Modeling AB411 Exceedances

Author

DebrisEase team

This is a report on the modeling of exceedances of AB411 at Goleta Beach and Carpinteria Beach, aiming to understand the influence of sediment deposition. The data used in this analysis is from the Santa Barbara County Flood Control District, the California State Water Resources Control Board, the USGS National Water Information System, and NOAA’s National Centers for Environmental Information. The data was filtered to 2010 - 2024. The analysis will focus on the relationship between sediment deposition, streamflow discharge, rainfall, and water quality exceedances. The goal is to determine if deposition events are associated with water quality (fecal indicator bacteria) exceedances, which have public health implications.

Setup

Code
library(tidyverse)
library(readxl)
library(lubridate)
library(here)
library(janitor)
library(tictoc)
library(ggplot2)
library(plotly) # for interactive plots
library(nortest) # for anderson-darling normality test
library(tsibble)
library(car) # for VIF

# ---------- data directories ---------
# raw data directory
raw_data_dir <- here("beach_nourishment", "wrcb_water_qual", "raw_data")

# intermediate data directory
int_dir <- here("beach_nourishment", "wrcb_water_qual", "intermediate")

# output data directory
output_dir <- here("beach_nourishment", "wrcb_water_qual", "output")

Wet Multivariate Logistic Regression Analysis

The “wet” models are really just multivariate logistic regression models that include both wet and dry days, meaning that the models are not just predicting exceedances on wet days. The models are predicting exceedances on all days, regardless of whether it rained or not. The models are predicting exceedances based on the predictors (streamflow discharge, rainfall, and sediment deposition) on the day of the exceedance.

Consolidate data

Streamflow Discharge

USGS Gauge data for San Jose Creek and Carpinteria Creek was used to determine the relationship between deposition events, rainfall, streamflow discharge, and water quality. The data was downloaded from the USGS National Water Information System https://waterdata.usgs.gov/nwis. The data was downloaded for the years 2000-2024. Discharge is provided in cubic feet per second (ft³/s).

Code
# install the dataRetrieval package from USGS
library(remotes)
library(curl)
install_github("DOI-USGS/dataRetrieval",
               build_vignettes = TRUE, 
               build_opts = c("--no-resave-data",
                              "--no-manual"))

# package citation
citation(package = "dataRetrieval")
To cite dataRetrieval in publications, please use:

  De Cicco, L.A., Hirsch, R.M., Lorenz, D., Watkins, W.D., Johnson, M.,
  2025, dataRetrieval: R packages for discovering and retrieving water
  data available from Federal hydrologic web services, v.2.7.18,
  doi:10.5066/P9X4L3GE

A BibTeX entry for LaTeX users is

  @Manual{,
    author = {Laura DeCicco and Robert Hirsch and David Lorenz and Jordan Read and Jordan Walker and Lindsay Platt and David Watkins and David Blodgett and Mike Johnson and Aliesha Krall and Lee Stanish},
    title = {dataRetrieval: R packages for discovering and retrieving water data available from U.S. federal hydrologic web services},
    publisher = {U.S. Geological Survey},
    address = {Reston, VA},
    version = {2.7.18},
    institution = {U.S. Geological Survey},
    year = {2025},
    doi = {10.5066/P9X4L3GE},
  }
Code
# load the dataRetrieval package
library(dataRetrieval)

whatNWISdata(siteNumber = "11120500")
   agency_cd  site_no              station_nm site_tp_cd dec_lat_va dec_long_va
1       USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
2       USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
3       USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
4       USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
5       USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
6       USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
7       USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
8       USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
9       USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
10      USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
11      USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
12      USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
13      USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
14      USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
15      USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
16      USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
17      USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
18      USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
19      USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
20      USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
21      USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
22      USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
23      USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
24      USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
25      USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
26      USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
27      USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
28      USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
29      USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
30      USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
31      USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
32      USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
33      USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
34      USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
35      USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
36      USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
37      USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
38      USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
39      USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
40      USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
41      USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
42      USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
43      USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
44      USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
45      USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
46      USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
47      USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
48      USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
49      USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
50      USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
51      USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
52      USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
53      USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
54      USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
55      USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
56      USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
57      USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
58      USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
59      USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
60      USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
61      USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
62      USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
63      USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
64      USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
65      USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
66      USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
67      USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
68      USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
69      USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
70      USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
71      USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
72      USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
73      USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
74      USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
75      USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
76      USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
77      USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
78      USGS 11120500 SAN JOSE C NR GOLETA CA         ST   34.45915    -119.809
   coord_acy_cd dec_coord_datum_cd alt_va alt_acy_va alt_datum_cd   huc_cd
1             F              NAD83  97.85       0.16       NAVD88 18060013
2             F              NAD83  97.85       0.16       NAVD88 18060013
3             F              NAD83  97.85       0.16       NAVD88 18060013
4             F              NAD83  97.85       0.16       NAVD88 18060013
5             F              NAD83  97.85       0.16       NAVD88 18060013
6             F              NAD83  97.85       0.16       NAVD88 18060013
7             F              NAD83  97.85       0.16       NAVD88 18060013
8             F              NAD83  97.85       0.16       NAVD88 18060013
9             F              NAD83  97.85       0.16       NAVD88 18060013
10            F              NAD83  97.85       0.16       NAVD88 18060013
11            F              NAD83  97.85       0.16       NAVD88 18060013
12            F              NAD83  97.85       0.16       NAVD88 18060013
13            F              NAD83  97.85       0.16       NAVD88 18060013
14            F              NAD83  97.85       0.16       NAVD88 18060013
15            F              NAD83  97.85       0.16       NAVD88 18060013
16            F              NAD83  97.85       0.16       NAVD88 18060013
17            F              NAD83  97.85       0.16       NAVD88 18060013
18            F              NAD83  97.85       0.16       NAVD88 18060013
19            F              NAD83  97.85       0.16       NAVD88 18060013
20            F              NAD83  97.85       0.16       NAVD88 18060013
21            F              NAD83  97.85       0.16       NAVD88 18060013
22            F              NAD83  97.85       0.16       NAVD88 18060013
23            F              NAD83  97.85       0.16       NAVD88 18060013
24            F              NAD83  97.85       0.16       NAVD88 18060013
25            F              NAD83  97.85       0.16       NAVD88 18060013
26            F              NAD83  97.85       0.16       NAVD88 18060013
27            F              NAD83  97.85       0.16       NAVD88 18060013
28            F              NAD83  97.85       0.16       NAVD88 18060013
29            F              NAD83  97.85       0.16       NAVD88 18060013
30            F              NAD83  97.85       0.16       NAVD88 18060013
31            F              NAD83  97.85       0.16       NAVD88 18060013
32            F              NAD83  97.85       0.16       NAVD88 18060013
33            F              NAD83  97.85       0.16       NAVD88 18060013
34            F              NAD83  97.85       0.16       NAVD88 18060013
35            F              NAD83  97.85       0.16       NAVD88 18060013
36            F              NAD83  97.85       0.16       NAVD88 18060013
37            F              NAD83  97.85       0.16       NAVD88 18060013
38            F              NAD83  97.85       0.16       NAVD88 18060013
39            F              NAD83  97.85       0.16       NAVD88 18060013
40            F              NAD83  97.85       0.16       NAVD88 18060013
41            F              NAD83  97.85       0.16       NAVD88 18060013
42            F              NAD83  97.85       0.16       NAVD88 18060013
43            F              NAD83  97.85       0.16       NAVD88 18060013
44            F              NAD83  97.85       0.16       NAVD88 18060013
45            F              NAD83  97.85       0.16       NAVD88 18060013
46            F              NAD83  97.85       0.16       NAVD88 18060013
47            F              NAD83  97.85       0.16       NAVD88 18060013
48            F              NAD83  97.85       0.16       NAVD88 18060013
49            F              NAD83  97.85       0.16       NAVD88 18060013
50            F              NAD83  97.85       0.16       NAVD88 18060013
51            F              NAD83  97.85       0.16       NAVD88 18060013
52            F              NAD83  97.85       0.16       NAVD88 18060013
53            F              NAD83  97.85       0.16       NAVD88 18060013
54            F              NAD83  97.85       0.16       NAVD88 18060013
55            F              NAD83  97.85       0.16       NAVD88 18060013
56            F              NAD83  97.85       0.16       NAVD88 18060013
57            F              NAD83  97.85       0.16       NAVD88 18060013
58            F              NAD83  97.85       0.16       NAVD88 18060013
59            F              NAD83  97.85       0.16       NAVD88 18060013
60            F              NAD83  97.85       0.16       NAVD88 18060013
61            F              NAD83  97.85       0.16       NAVD88 18060013
62            F              NAD83  97.85       0.16       NAVD88 18060013
63            F              NAD83  97.85       0.16       NAVD88 18060013
64            F              NAD83  97.85       0.16       NAVD88 18060013
65            F              NAD83  97.85       0.16       NAVD88 18060013
66            F              NAD83  97.85       0.16       NAVD88 18060013
67            F              NAD83  97.85       0.16       NAVD88 18060013
68            F              NAD83  97.85       0.16       NAVD88 18060013
69            F              NAD83  97.85       0.16       NAVD88 18060013
70            F              NAD83  97.85       0.16       NAVD88 18060013
71            F              NAD83  97.85       0.16       NAVD88 18060013
72            F              NAD83  97.85       0.16       NAVD88 18060013
73            F              NAD83  97.85       0.16       NAVD88 18060013
74            F              NAD83  97.85       0.16       NAVD88 18060013
75            F              NAD83  97.85       0.16       NAVD88 18060013
76            F              NAD83  97.85       0.16       NAVD88 18060013
77            F              NAD83  97.85       0.16       NAVD88 18060013
78            F              NAD83  97.85       0.16       NAVD88 18060013
   data_type_cd parm_cd stat_cd  ts_id loc_web_ds medium_grp_cd parm_grp_cd
1            ad    <NA>    <NA>      0         NA           wat        <NA>
2            dv   00060   00003   8464         NA           wat        <NA>
3            pk    <NA>    <NA>      0         NA           wat        <NA>
4            qw    <NA>    <NA>      0         NA           wat         ALL
5            qw   00001    <NA>      0         NA           wat         INF
6            qw   00009    <NA>      0         NA           wat         INF
7            qw   00010    <NA>      0         NA           wat         PHY
8            qw   00020    <NA>      0         NA           wat         PHY
9            qw   00025    <NA>      0         NA           wat         PHY
10           qw   00028    <NA>      0         NA           wat         INF
11           qw   00060    <NA>      0         NA           wat         PHY
12           qw   00061    <NA>      0         NA           wat         PHY
13           qw   00063    <NA>      0         NA           wat         INF
14           qw   00065    <NA>      0         NA           wat         PHY
15           qw   00094    <NA>      0         NA           wat         PHY
16           qw   00095    <NA>      0         NA           wat         PHY
17           qw   00191    <NA>      0         NA           wat         INN
18           qw   00300    <NA>      0         NA           wat         INN
19           qw   00301    <NA>      0         NA           wat         INN
20           qw   00400    <NA>      0         NA           wat         PHY
21           qw   00403    <NA>      0         NA           wat         PHY
22           qw   00405    <NA>      0         NA           wat         INN
23           qw   00410    <NA>      0         NA           wat         INN
24           qw   00419    <NA>      0         NA           wat         INN
25           qw   00447    <NA>      0         NA           wat         INN
26           qw   00450    <NA>      0         NA           wat         INN
27           qw   00608    <NA>      0         NA           wat         NUT
28           qw   00613    <NA>      0         NA           wat         NUT
29           qw   00618    <NA>      0         NA           wat         NUT
30           qw   00620    <NA>      0         NA           wat         NUT
31           qw   00631    <NA>      0         NA           wat         NUT
32           qw   00660    <NA>      0         NA           wat         NUT
33           qw   00671    <NA>      0         NA           wat         NUT
34           qw   00900    <NA>      0         NA           wat         PHY
35           qw   00902    <NA>      0         NA           wat         PHY
36           qw   00904    <NA>      0         NA           wat         PHY
37           qw   00915    <NA>      0         NA           wat         INM
38           qw   00925    <NA>      0         NA           wat         INM
39           qw   00930    <NA>      0         NA           wat         INM
40           qw   00931    <NA>      0         NA           wat         INM
41           qw   00932    <NA>      0         NA           wat         INM
42           qw   00933    <NA>      0         NA           wat         INM
43           qw   00935    <NA>      0         NA           wat         INM
44           qw   00940    <NA>      0         NA           wat         INN
45           qw   00945    <NA>      0         NA           wat         INN
46           qw   00950    <NA>      0         NA           wat         INN
47           qw   00955    <NA>      0         NA           wat         INN
48           qw   01020    <NA>      0         NA           wat         IMN
49           qw   01046    <NA>      0         NA           wat         IMM
50           qw   01056    <NA>      0         NA           wat         IMM
51           qw   30207    <NA>      0         NA           wat         PHY
52           qw   30208    <NA>      0         NA           wat         PHY
53           qw   30209    <NA>      0         NA           wat         PHY
54           qw   31625    <NA>      0         NA           wat         MBI
55           qw   39086    <NA>      0         NA           wat         INN
56           qw   70300    <NA>      0         NA           wat         PHY
57           qw   70301    <NA>      0         NA           wat         PHY
58           qw   70302    <NA>      0         NA           wat         PHY
59           qw   70303    <NA>      0         NA           wat         PHY
60           qw   71846    <NA>      0         NA           wat         NUT
61           qw   71850    <NA>      0         NA           wat         NUT
62           qw   71851    <NA>      0         NA           wat         NUT
63           qw   71856    <NA>      0         NA           wat         NUT
64           qw   80164    <NA>      0         NA           wat         SED
65           qw   80165    <NA>      0         NA           wat         SED
66           qw   80166    <NA>      0         NA           wat         SED
67           qw   80167    <NA>      0         NA           wat         SED
68           qw   80168    <NA>      0         NA           wat         SED
69           qw   80169    <NA>      0         NA           wat         SED
70           qw   80170    <NA>      0         NA           wat         SED
71           qw   82398    <NA>      0         NA           wat         INF
72           qw   90095    <NA>      0         NA           wat         PHY
73           qw   90410    <NA>      0         NA           wat         INN
74           qw   95902    <NA>      0         NA           wat         PHY
75           sv    <NA>    <NA>      0         NA           wat        <NA>
76           uv   00060    <NA>  14451         NA           wat        <NA>
77           uv   00065    <NA>  14452         NA           wat        <NA>
78           uv   63160    <NA> 346201         NA           wat        <NA>
     srs_id access_cd begin_date   end_date count_nu
1         0         0 2005-01-01 2024-01-01       20
2   1645423         0 1941-01-01 2025-03-01    30478
3         0         0 1941-04-04 2024-02-19       83
4         0         0 1977-08-09 1991-09-23      154
5         0         0 1986-06-04 1986-06-04        1
6         0         0 1987-08-03 1989-09-06        6
7   1645597         0 1977-08-09 1991-09-23      148
8   1645720         0 1983-09-07 1987-09-04        3
9   1641265         0 1990-02-27 1991-03-01        2
10        0         0 1977-08-09 1991-09-23      145
11  1645423         0 1977-08-09 1977-10-03        3
12  1645415         0 1977-10-03 1991-09-23      146
13        0         0 1985-02-26 1985-02-26        2
14 17164583         0 1983-05-26 1983-09-28        4
15  1646694         0 1980-10-01 1980-10-01        1
16  1646694         0 1977-08-09 1991-09-23      149
17  1640572         0 1977-08-09 1991-03-01      137
18   154302         0 1990-02-27 1991-03-01        2
19   154302         0 1990-02-27 1991-03-01        2
20 17028275         0 1977-08-09 1991-03-01      137
21  1644335         0 1980-10-01 1991-09-23      113
22    33548         0 1978-08-02 1991-03-01       13
23  1640192         0 1978-08-02 1991-03-01       10
24  1640192         0 1990-02-27 1991-03-01        2
25   119396         0 1987-03-11 1987-12-29        2
26     4788         0 1990-02-27 1991-03-01        2
27   966655         0 1991-03-01 1991-03-01        1
28   197194         0 1991-03-01 1991-03-01        1
29   197186         0 1991-03-01 1991-03-01        1
30   197186         0 1977-08-09 1977-08-09        1
31   701177         0 1978-08-02 1991-03-01       13
32   194464         0 1978-08-02 1991-03-01       13
33   194464         0 1978-08-02 1991-03-01       13
34  1640416         0 1977-08-09 1991-03-01       14
35  1640432         0 1980-01-03 1986-01-23        3
36  1640432         0 1989-01-12 1989-01-12        1
37   150268         0 1977-08-09 1991-03-01       14
38   149617         0 1977-08-09 1991-03-01       14
39   149856         0 1977-08-09 1991-03-01       14
40  1640523         0 1977-08-09 1991-03-01       14
41  1646660         0 1977-08-09 1991-03-01       14
42   967331         0 1980-01-03 1980-01-03        1
43   149740         0 1977-08-09 1991-03-01       14
44   205955         0 1977-08-09 1991-03-01       14
45   197301         0 1977-08-09 1991-03-01       14
46   206706         0 1978-08-02 1991-03-01       13
47   151977         0 1978-08-02 1991-03-01       13
48   150011         0 1977-08-09 1991-03-01       13
49   149559         0 1978-08-02 1991-03-01       13
50   149625         0 1978-08-02 1991-03-01       10
51 17164583         0 1983-05-26 1983-09-28        4
52        0         0 1977-08-09 1977-10-03        3
53  1645415         0 1977-10-03 1991-09-23      146
54   761692         0 1987-03-11 1987-03-11        1
55  1640192         0 1989-01-12 1989-01-12        1
56  1642222         0 1977-08-09 1991-09-23      140
57  1642222         0 1978-08-02 1991-03-01       13
58  1642222         0 1977-08-09 1991-09-23      144
59  1642222         0 1977-08-09 1991-09-23      150
60   966655         0 1991-03-01 1991-03-01        1
61   197186         0 1977-08-09 1977-08-09        1
62   197186         0 1991-03-01 1991-03-01        1
63   197194         0 1991-03-01 1991-03-01        1
64 17136284         0 1985-02-26 1985-02-26        1
65 17136284         0 1985-02-26 1985-02-26        1
66 17136284         0 1985-02-26 1985-02-26        2
67 17136284         0 1985-02-26 1985-02-26        2
68 17136284         0 1985-02-26 1985-02-26        2
69 17136284         0 1985-02-26 1985-02-26        2
70 17136284         0 1985-02-26 1985-02-26        1
71        0         0 1983-11-03 1991-09-23       73
72  1646694         0 1980-11-10 1991-09-23      112
73  1640192         0 1980-12-10 1991-03-01       10
74  1640432         0 1980-12-10 1986-01-23        3
75        0         0 1941-01-07 2025-01-03     1596
76  1645423         0 1988-10-01 2025-03-02    13301
77 17164583         0 2007-10-01 2025-03-02     6362
78  1642503         0 2022-10-03 2025-03-02      881
Code
# Goleta Beach: San Jose Creek
san_jose_creek_raw <- readNWISdv(
  siteNumber = "11120500",
  parameterCd = "00060",
  startDate = "2000-01-01",
  endDate = "2024-12-31",
  statCd = "00003"
) %>% 
  janitor::clean_names()

san_jose_creek_discharge <- san_jose_creek_raw %>% 
  mutate(date = as.Date(date)) %>% 
  select(date, sj_discharge = "x_00060_00003")
# write_csv(san_jose_creek_discharge, here(int_dir, "san_jose_creek_discharge.csv"))

# Carpinteria Beach: Carpinteria Creek
carp_creek_raw <- readNWISdv(
  siteNumber = "11119500",
  parameterCd = "00060",
  startDate = "2000-01-01",
  endDate = "2024-12-31",
  statCd = "00003"
) %>% 
  janitor::clean_names()

carp_creek_discharge <- carp_creek_raw %>%
  mutate(date = as.Date(date)) %>% 
  select(date, carp_discharge = "x_00060_00003")
# write_csv(carp_creek_discharge, here(int_dir, "carp_creek_discharge.csv"))

The Arroyo Burro Beach (control site) data was obtained from the Santa Barbara Long-Term Ecological Research (SBLTER) site for 2001 to 2018-09-30. The data was downloaded from the Environmental Data Initiative here: https://portal.edirepository.org/nis/mapbrowse?packageid=knb-lter-sbc.3001.9. Discharge is provided in liters per second.

1 liter is about 0.03531470 cubic feet.

Code
# Arroyo Burro Beach (control site), from SBLTER not USGS
ab_creek_raw <- read.delim2(here(raw_data_dir, "sblter_arroyo_burro_discharge_2001_2018.txt"), header = TRUE, sep = ",", dec = ".") %>% 
  janitor::clean_names() %>% 
  select(timestamp_local, discharge_lps)

# separate the date from the time and convert to lubridate
ab_creek_discharge <- ab_creek_raw %>% 
  separate(timestamp_local, into = c("date", "time"), sep = "T") %>% 
  mutate(date = as.Date(date)) %>% 
  mutate(discharge_lps = ifelse(discharge_lps == c("-999"), NA, discharge_lps)) %>% 
  filter(date >= as.Date("2010-01-01"))# only keep the years 2010-2018

# sum the discharge for each day and convert to cubic feet per second
ab_creek_discharge_cfs <- ab_creek_discharge %>%
  group_by(date) %>%
  summarize(ab_discharge_lps = mean(discharge_lps)) %>% # average the discharge for each day for the mean daily discharge
  mutate(ab_discharge_cfs = ab_discharge_lps * 0.0353146667) # convert to cubic feet per second

Daily Precipitation

This is from county precipitation sites. The data was downloaded from the County of Santa Barbara Public Works Department https://rain.cosbpw.net/map/?sensor_class=10%7C1440&view=8bc6e88f-eeab-4281-9d92-3d723016e945. The data was downloaded for the years 2000-2024 for the UCSB (200) and Carpinteria Fire Station sites (208).

Code
# Goleta Beach: station 200
ucsb_precip_raw <- read_xls(here(raw_data_dir, "200dailys.xls")) %>% 
  clean_names()

ucsb_precip <- ucsb_precip_raw %>% 
  select(x3, x4, x5, x6) %>% 
  rename(year = x3, month = x4, day = x5, daily_rain = x6) %>%
  slice(-c(0:9)) %>%
  filter(year %in% 2000:2025) %>%
  mutate(date = make_date(year, month, day)) %>%
  select(date, daily_rain)
# write_csv(ucsb_precip, here(int_dir, "ucsb_precip.csv"))

####### using NOAA data
noaa_precip_gb <- read_csv(here(raw_data_dir, "noaa_gb_daily_summary_precip.csv")) %>% 
  janitor::clean_names() %>% 
  mutate(date = as.Date(date)) %>% 
  select(date, daily_rain = prcp) # rename into daily_rain

# Carpinteria Beach: station 208
carp_precip_raw <- read_xls(here(raw_data_dir, "208dailys.xls")) %>% 
  clean_names()

carp_precip <- carp_precip_raw %>%
  select(x3, x4, x5, x6) %>% 
  rename(year = x3, month = x4, day = x5, daily_rain = x6) %>%
  slice(-c(0:9)) %>%
  filter(year %in% 2000:2025) %>%
  mutate(date = make_date(year, month, day)) %>%
  select(date, daily_rain)
# write_csv(carp_precip, here(int_dir, "carp_precip.csv"))

Arroyo Burro Precipitation Evaluation

Code
# Arroyo Burro control site, using NOAA NCEI data:
## inland gauge site
noaa_precip_ab_inland <- read_csv(here(raw_data_dir, "noaa_arroyo_burro_inland_daily_summary_precip.csv")) %>% 
  janitor::clean_names() %>% 
  mutate(date = as.Date(date)) %>% 
  select(date, daily_rain = prcp) # rename into daily_rain

## east beach gauge site
noaa_precip_ab_east <- read_csv(here(raw_data_dir, "noaa_arroyo_burro_east_daily_summary_precip.csv")) %>% 
  janitor::clean_names() %>% 
  mutate(date = as.Date(date)) %>% 
  select(date, daily_rain = prcp) # rename into daily_rain

A t-test will be used to determine if the inland and east beach gauge sites are similar. The east beach gauge site has more data (2000-2025) compared to the inland gauge site (2016 - 2025).

Code
# test whether inland and east beach gauge is similar
plot_inland_and_east <- ggplot() +
  geom_point(data = noaa_precip_ab_inland, aes(x = date, y = daily_rain), color = "blue") +
  geom_line(data = noaa_precip_ab_inland, aes(x = date, y = daily_rain), color = "blue") + 
  geom_point(data = noaa_precip_ab_east, aes(x = date, y = daily_rain), color = "red") +
  geom_smooth(data = noaa_precip_ab_east, aes(x = date, y = daily_rain), color = "red") +
  labs(title = "Comparison of Precipitation at Arroyo Burro Beach Inland and East Beach Gauge Sites",
       x = "Date",
       y = "Daily Precipitation (inches)",
       color = "Gauge Site") +
  theme_minimal()
plot_inland_and_east

Code
# first, check for normality.  we will use the shapiro-wilks test for the inland data, and the anderson-darling test for the east beach data (which has more than 5000 observations).  See <https://www.researchgate.net/publication/267205556_Power_Comparisons_of_Shapiro-Wilk_Kolmogorov-Smirnov_Lilliefors_and_Anderson-Darling_Tests> for more information.

shapiro.test(noaa_precip_ab_inland$daily_rain) # W = 0.60304, p-value < 2.2e-16, so not normal

    Shapiro-Wilk normality test

data:  noaa_precip_ab_inland$daily_rain
W = 0.60304, p-value < 2.2e-16
Code
ad.test(noaa_precip_ab_east$daily_rain) # A = 1780.2, p-value < 2.2e-16

    Anderson-Darling normality test

data:  noaa_precip_ab_east$daily_rain
A = 1780.2, p-value < 2.2e-16
Code
## Q-Q plots
qqnorm(noaa_precip_ab_inland$daily_rain)
qqline(noaa_precip_ab_inland$daily_rain)

Code
qqnorm(noaa_precip_ab_east$daily_rain)
qqline(noaa_precip_ab_east$daily_rain)

Code
# since the data is not normal, we will use the wilcoxon rank sum test
wilcox.test(noaa_precip_ab_inland$daily_rain, noaa_precip_ab_east$daily_rain)

    Wilcoxon rank sum test with continuity correction

data:  noaa_precip_ab_inland$daily_rain and noaa_precip_ab_east$daily_rain
W = 1807158, p-value < 2.2e-16
alternative hypothesis: true location shift is not equal to 0
Code
#   Wilcoxon rank sum test with continuity correction
# 
# data:  noaa_precip_ab_inland$daily_rain and noaa_precip_ab_east$daily_rain
# W = 1807158, p-value < 2.2e-16
# alternative hypothesis: true location shift is not equal to 0

The two sites are significantly different. The east beach gauge site will be used for the Arroyo Burro Beach precipitation data, as it has many more observations (5537) versus the inland gauge site (392). Considering that the inland site only has data from 2016-2025, it is likely that the east beach gauge site is more representative of the entire time period of interest (2010-2025). Also, it is suspicious that for almost 10 years there are only 392 observations at the inland site, which is less than 1/10th of the number of observations at the east beach site.

Let’s now see if precipitation can be used without discharge data for years after 2018.

Code
# see if precipitation and discharge are correlated
ab_discharge_precip_data <- ab_creek_discharge_cfs %>% 
  left_join(noaa_precip_ab_east, by = "date")

cor(ab_discharge_precip_data$ab_discharge_cfs, ab_discharge_precip_data$daily_rain, use = "complete.obs") # 0.8446518, so they are correlated!!! Let's check further
[1] 0.8446518
Code
ggplot(ab_discharge_precip_data, aes(x = daily_rain, y = ab_discharge_cfs)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(x = "Daily Precipitation (inches)", y = "Discharge (cfs)", 
       title = "Arroyo Burro Discharge vs. Precipitation") +
  theme_minimal()

Code
ggplot(ab_discharge_precip_data, aes(x = date)) +
  geom_line(aes(y = ab_discharge_cfs, color = "Discharge")) +
  geom_line(aes(y = daily_rain * 100, color = "Precipitation")) +
  scale_y_continuous(sec.axis = sec_axis(~./100, name = "Precipitation (inches)")) +
  labs(y = "Discharge (cfs)", title = "Discharge and Precipitation Over Time") +
  scale_color_manual(values = c("Discharge" = "blue", "Precipitation" = "red"))

Code
# consider lagged correlations
lag_cor <- function(lag) {
  ab_discharge_precip_data %>%
    mutate(precip_lag = lag(daily_rain, n = lag)) %>%
    summarise(cor = cor(ab_discharge_cfs, precip_lag, use = "complete.obs")) %>%
    pull(cor)
}

lags <- 0:10
lag_cors <- map_dbl(lags, lag_cor)

data.frame(lag = lags, correlation = lag_cors) %>%
  ggplot(aes(x = lag, y = correlation)) +
  geom_line() +
  geom_point() +
  labs(title = "Lagged Correlation between Precipitation and Discharge")

Code
ab_discharge_precip_data %>%
  mutate(month = lubridate::month(date)) %>%
  group_by(month) %>%
  summarise(correlation = cor(ab_discharge_cfs, daily_rain, use = "complete.obs")) %>%
  ggplot(aes(x = month, y = correlation)) +
  geom_line() +
  geom_point() +
  scale_x_continuous(breaks = 1:12) +
  labs(title = "Monthly Correlation between Precipitation and Discharge")

Code
ab_discharge_precip_model <- lm(ab_discharge_cfs ~ daily_rain, data = ab_discharge_precip_data)
summary(ab_discharge_precip_model)

Call:
lm(formula = ab_discharge_cfs ~ daily_rain, data = ab_discharge_precip_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-69.175  -0.111   0.142   0.606 204.684 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.2361     0.1220   1.935    0.053 .  
daily_rain   48.7387     0.5560  87.653   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.686 on 3086 degrees of freedom
  (106 observations deleted due to missingness)
Multiple R-squared:  0.7134,    Adjusted R-squared:  0.7133 
F-statistic:  7683 on 1 and 3086 DF,  p-value: < 2.2e-16

The adjusted R-squared value is 0.7133, indicating that about 71.33% of the variance in discharge (ab_discharge_cfs) can be explained by daily rainfall. The F-statistic (7683) with a very low p-value (< 2.2e-16) suggests that the model is statistically significant.

Intercept (0.2361): When there’s no rain, the expected discharge is about 0.2361 cfs. However, this is not statistically significant (p = 0.053).

daily_rain (48.7387): For each inch of rain, the discharge is expected to increase by about 48.7387 cfs. This is highly statistically significant (p < 2e-16).

The residuals range from -69.175 to 204.684, with the median close to zero (0.142). This suggests the model fits reasonably well for most data points, but there are some large outliers. The very low p-value for the daily_rain coefficient (< 2e-16) indicates a strong statistical relationship between rainfall and discharge – BUT! It may still not be a good idea to use precipitation as a proxy for discharge, as the model is not perfect and there are likely other factors influencing discharge that are not captured by precipitation data……. need to run this by the team.

Sediment Deposition Binary Variable

Determined from the presence of sediment deposition at the beach. The data was collected by the Santa Barbara County Flood Control District. The data is available for the years 2010-2024.

Code
# Goleta Beach
goleta_beach_deposition <- read_csv(here(raw_data_dir, "binary_deposition_occurrence_gb.csv"))


# Carpinteria Beach
carp_beach_deposition <- read_csv(here(raw_data_dir, "binary_deposition_occurrence_carp_ash_ave.csv"))

# Arroyo Burro Beach (control site)
ab_beach_deposition <- read_csv(here(raw_data_dir, "binary_deposition_occurrence_arroyo_burro.csv"))

Water Quality Data

Data found from https://www.waterboards.ca.gov/water_issues/programs/beaches/beach_water_quality/, specifically https://www.waterboards.ca.gov/water_issues/programs/beaches/search_beach_mon.html. The data is from the California State Water Resources Control Board. The data is from the Santa Barbara County Beaches. The data is from 2000 to 2024, though data was downloaded from 2010 to 2024.

Code
# read in the data
all_beach_qual_raw <- read_csv(here(raw_data_dir, "beach_monitoring_wrcb-AllResultData.csv")) %>% 
  clean_names()

# Goleta Beach
goleta_beach_qual <- all_beach_qual_raw %>% 
  filter(station_description %in% "Goleta Beach")

# convert to lubridate
goleta_beach_qual_ymd <- goleta_beach_qual %>% 
  mutate(sample_date = mdy(sample_date)) %>% 
  select(date = sample_date, gb_parameter = parameter, gb_result = result)

# make binary variable for any exceedance of the state health standard: 104 MPN/100mL for enterococcus, 400 MPN/100mL for fecal coliforms, 400 MPN/100mL for ecoli, and 1000 MPN/100mL for total coliforms, and 0 if not exceeded
binary_water_qual_gb <- goleta_beach_qual_ymd %>% 
  mutate(enterococcus_exceed = ifelse(gb_parameter %in% "Enterococcus" & gb_result > 104, 1, 0),
         ecoli_exceed = ifelse(gb_parameter %in% "E. Coli" & gb_result > 400, 1, 0),
         fecal_coliforms_exceed = ifelse(gb_parameter %in% "Fecal Coliforms" & gb_result > 400, 1, 0),
         total_coliforms_exceed = ifelse(gb_parameter %in% "Total Coliforms" & gb_result > 1000, 1, 0),
         any_exceed = ifelse(enterococcus_exceed == 1 | ecoli_exceed == 1 | fecal_coliforms_exceed == 1 | total_coliforms_exceed == 1, 1, 0))
Code
# Carpinteria Beach
carpenteria_beach_qual <- all_beach_qual_raw %>% 
  filter(station_description %in% "Carpinteria State")

# convert to lubridate
carpenteria_beach_qual_ymd <- carpenteria_beach_qual %>% 
  mutate(sample_date = mdy(sample_date)) %>% 
  select(date = sample_date, carp_parameter = parameter, carp_result = result)

# make binary variable for any exceedance of the state health standard: 104 MPN/100mL for enterococcus, 400 MPN/100mL for fecal coliforms, 400 MPN/100mL for ecoli, and 1000 MPN/100mL for total coliforms, and 0 if not exceeded
binary_water_qual_carp <- carpenteria_beach_qual_ymd %>% 
  mutate(enterococcus_exceed = ifelse(carp_parameter %in% "Enterococcus" & carp_result > 104, 1, 0),
         ecoli_exceed = ifelse(carp_parameter %in% "E. Coli" & carp_result > 400, 1, 0),
         fecal_coliforms_exceed = ifelse(carp_parameter %in% "Fecal Coliforms" & carp_result > 400, 1, 0),
         total_coliforms_exceed = ifelse(carp_parameter %in% "Total Coliforms" & carp_result > 1000, 1, 0),
         any_exceed = ifelse(enterococcus_exceed == 1 | ecoli_exceed == 1 | fecal_coliforms_exceed == 1 | total_coliforms_exceed == 1, 1, 0))
Code
# Arroyo Burro Beach (control site)
ab_beach_qual_raw <- read_xlsx(here(raw_data_dir, "beach_monitoring_wrcb-arroyo_burro.xlsx")) %>% 
  clean_names()

# convert to lubridate
ab_beach_qual_ymd <- ab_beach_qual_raw %>% 
  mutate(sample_date = ymd(sample_date)) %>% 
  select(date = sample_date, ab_parameter = parameter, ab_result = result)

# make binary variable for any exceedance of the state health standard: 104 MPN/100mL for enterococcus, 400 MPN/100mL for fecal coliforms, 400 MPN/100mL for ecoli, and 1000 MPN/100mL for total coliforms, and 0 if not exceeded
binary_water_qual_ab <- ab_beach_qual_ymd %>% 
  mutate(enterococcus_exceed = ifelse(ab_parameter %in% "Enterococcus" & ab_result > 104, 1, 0),
         ecoli_exceed = ifelse(ab_parameter %in% "E. Coli" & ab_result > 400, 1, 0),
         fecal_coliforms_exceed = ifelse(ab_parameter %in% "Fecal Coliforms" & ab_result > 400, 1, 0),
         total_coliforms_exceed = ifelse(ab_parameter %in% "Total Coliforms" & ab_result > 1000, 1, 0),
         any_exceed = ifelse(enterococcus_exceed == 1 | ecoli_exceed == 1 | fecal_coliforms_exceed == 1 | total_coliforms_exceed == 1, 1, 0))

Merge Data

Question: should I drop NAs before modeling?

Code
# Goleta Beach
goleta_beach_data <- binary_water_qual_gb %>% 
  left_join(san_jose_creek_discharge, by = "date") %>% 
  left_join(noaa_precip_gb, by = "date") %>% 
  left_join(goleta_beach_deposition, by = "date")
Code
# Carpinteria Beach
carpenteria_beach_data <- binary_water_qual_carp %>% 
  left_join(carp_creek_discharge, by = "date") %>% 
  left_join(carp_precip, by = "date") %>% 
  left_join(carp_beach_deposition, by = "date")
Code
# Arroyo Burro Beach (control site)
ab_beach_data <- binary_water_qual_ab %>% 
  left_join(ab_creek_discharge_cfs, by = "date") %>% 
  left_join(noaa_precip_ab_east, by = "date") %>% 
  left_join(ab_beach_deposition, by = "date")

Check Assumptions and Assess Linearity

Goleta Beach

Code
# test for independence of observations
# this assumption is met since data collection was random and independent
Code
# check for multicollinearity using Variance Inflation Factors (VIF)
library(car)
goleta_beach_data$daily_rain <- as.numeric(goleta_beach_data$daily_rain) # because it was a character before

goleta_model <- glm(any_exceed ~ sj_discharge + daily_rain + deposition_occurred, 
                    data = goleta_beach_data, 
                    family = binomial)
vif(goleta_model) # use the vif() function from the car package to check for multicollinearity. VIF values greater than 5 or 10 indicate potential multicollinearity issues
       sj_discharge          daily_rain deposition_occurred 
           1.081608            1.061481            1.031649 
Code
       # sj_discharge          daily_rain deposition_occurred 
       #     1.081608            1.061481            1.031649 

# yay, proceed!! no multicollinearity issues
Code
# assess linearity between continuous predictors and log odds of the outcome

# perform Box-Tidwell test, interaction terms are the continuous variables
goleta_beach_data$sj_discharge_interact <- goleta_beach_data$sj_discharge * log(goleta_beach_data$sj_discharge)
goleta_beach_data$daily_rain_interact <- goleta_beach_data$daily_rain * log(goleta_beach_data$daily_rain + 1) # adding 1 to avoid log(0)

gb_box_tidwell_model <- glm(any_exceed ~ sj_discharge + daily_rain + deposition_occurred + 
                           sj_discharge_interact + daily_rain_interact, 
                         data = goleta_beach_data, 
                         family = binomial)
summary(gb_box_tidwell_model) # deposition_occurred are statistically significant (p < 0.05) but that is okay because it is not linear, it is binary. The interaction terms (sj_discharge_interact and daily_rain_interact) are not statistically significant (p > 0.05).

Call:
glm(formula = any_exceed ~ sj_discharge + daily_rain + deposition_occurred + 
    sj_discharge_interact + daily_rain_interact, family = binomial, 
    data = goleta_beach_data)

Coefficients:
                      Estimate Std. Error z value Pr(>|z|)    
(Intercept)           -2.55945    0.10273 -24.914   <2e-16 ***
sj_discharge           0.09211    0.06204   1.485    0.138    
daily_rain             2.32432    1.79781   1.293    0.196    
deposition_occurred    2.25807    0.23231   9.720   <2e-16 ***
sj_discharge_interact -0.00993    0.01611  -0.617    0.538    
daily_rain_interact    0.52492    2.72104   0.193    0.847    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1264.9  on 1736  degrees of freedom
Residual deviance: 1053.0  on 1731  degrees of freedom
  (543 observations deleted due to missingness)
AIC: 1065

Number of Fisher Scoring iterations: 5
Code
# Coefficients:
#                       Estimate Std. Error z value Pr(>|z|)    
# (Intercept)           -2.55945    0.10273 -24.914   <2e-16 ***
# sj_discharge           0.09211    0.06204   1.485    0.138    
# daily_rain             2.32432    1.79781   1.293    0.196    
# deposition_occurred    2.25807    0.23231   9.720   <2e-16 ***
# sj_discharge_interact -0.00993    0.01611  -0.617    0.538    
# daily_rain_interact    0.52492    2.72104   0.193    0.847    
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# (Dispersion parameter for binomial family taken to be 1)
# 
#     Null deviance: 1264.9  on 1736  degrees of freedom
# Residual deviance: 1053.0  on 1731  degrees of freedom
#   (543 observations deleted due to missingness)
# AIC: 1065
# 
# Number of Fisher Scoring iterations: 5

# the Box-Tidwell test is performed by creating interaction terms between continuous predictors and their natural logs, then including these in the model. Significant interaction terms suggest non-linearity. 
# non-significant interaction terms (p > 0.05) suggest linearity between the continuous predictors and the log odds of the outcome.
Code
# create scatter plots of log odds vs. each predictor
# library(ggplot2)
# 
# # function to calculate log odds, visually assess linearity
# calc_log_odds <- function(x) log(x / (1 - x))
# 
# ggplot(goleta_beach_data, aes(x = sj_discharge, y = calc_log_odds(any_exceed))) +
#   geom_point() +
#   geom_smooth(method = "loess") +
#   labs(title = "Log Odds vs. San Jose Creek Discharge")
# 
# ggplot(goleta_beach_data, aes(x = daily_rain, y = calc_log_odds(any_exceed))) +
#   geom_point() +
#   geom_smooth(method = "loess") +
#   labs(title = "Log Odds vs. Daily Rain")
# 
# # generate partial residual plots
# library(effects) # partial residual plots are generated using the effects package to further assess linearity
# 
# plot(allEffects(goleta_model)) # looks fairly linear??

Carpinteria Beach

Code
# test for independence of observations
# this assumption is met since data collection was random and independent
Code
# check for multicollinearity using Variance Inflation Factors (VIF)
library(car)
carpenteria_beach_data$daily_rain <- as.numeric(carpenteria_beach_data$daily_rain) # because it was a character before

carpenteria_model <- glm(any_exceed ~ carp_discharge + daily_rain + deposition_occurred, 
                    data = carpenteria_beach_data, 
                    family = binomial)
vif(carpenteria_model) # use the vif() function from the car package to check for multicollinearity. VIF values greater than 5 or 10 indicate potential multicollinearity issues
     carp_discharge          daily_rain deposition_occurred 
           1.063039            1.079736            1.086214 
Code
     # carp_discharge          daily_rain deposition_occurred 
     #       1.141700            1.052755            1.133970 

# yay, proceed!! no multicollinearity issues
Code
# assess linearity between continuous predictors and log odds of the outcome

# perform Box-Tidwell test, interaction terms are the continuous variables
carpenteria_beach_data$carp_discharge_interact <- carpenteria_beach_data$carp_discharge * log(carpenteria_beach_data$carp_discharge)
carpenteria_beach_data$daily_rain_interact <- carpenteria_beach_data$daily_rain * log(carpenteria_beach_data$daily_rain + 1) # adding 1 to avoid log(0)

carp_box_tidwell_model <- glm(any_exceed ~ carp_discharge + daily_rain + deposition_occurred + 
                           carp_discharge_interact + daily_rain_interact, 
                         data = carpenteria_beach_data, 
                         family = binomial)
summary(carp_box_tidwell_model) # daily_rain is statistically significant (p < 0.05). The interaction terms (sj_discharge_interact and daily_rain_interact) are not statistically significant (p > 0.05). 

Call:
glm(formula = any_exceed ~ carp_discharge + daily_rain + deposition_occurred + 
    carp_discharge_interact + daily_rain_interact, family = binomial, 
    data = carpenteria_beach_data)

Coefficients:
                         Estimate Std. Error z value Pr(>|z|)    
(Intercept)             -1.328965   0.303993  -4.372 1.23e-05 ***
carp_discharge           0.069389   0.045635   1.521   0.1284    
daily_rain               3.995332   1.592725   2.508   0.0121 *  
deposition_occurred      0.358612   0.447016   0.802   0.4224    
carp_discharge_interact -0.013701   0.009309  -1.472   0.1411    
daily_rain_interact     -2.636537   1.886912  -1.397   0.1623    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 240.70  on 175  degrees of freedom
Residual deviance: 210.83  on 170  degrees of freedom
  (1984 observations deleted due to missingness)
AIC: 222.83

Number of Fisher Scoring iterations: 4
Code
# Coefficients:
#                          Estimate Std. Error z value Pr(>|z|)    
# (Intercept)             -1.263639   0.291883  -4.329  1.5e-05 ***
# carp_discharge           0.068925   0.045598   1.512   0.1306    
# daily_rain               3.863127   1.582730   2.441   0.0147 *  
# deposition_occurred      0.107901   0.485392   0.222   0.8241    
# carp_discharge_interact -0.013423   0.009267  -1.448   0.1475    
# daily_rain_interact     -2.554095   1.883818  -1.356   0.1752    
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# (Dispersion parameter for binomial family taken to be 1)
# 
#     Null deviance: 240.70  on 175  degrees of freedom
# Residual deviance: 211.42  on 170  degrees of freedom
#   (1984 observations deleted due to missingness)
# AIC: 223.42
# 
# Number of Fisher Scoring iterations: 4

# the Box-Tidwell test is performed by creating interaction terms between continuous predictors and their natural logs, then including these in the model. Significant interaction terms suggest non-linearity. 
# non-significant interaction terms (p > 0.05) suggest linearity between the continuous predictors and the log odds of the outcome.
Code
# create scatter plots of log odds vs. each predictor
# library(ggplot2)

# function to calculate log odds, visually assess linearity
# calc_log_odds <- function(x) log(x / (1 - x))
# 
# ggplot(goleta_beach_data, aes(x = sj_discharge, y = calc_log_odds(any_exceed))) +
#   geom_point() +
#   geom_smooth(method = "loess") +
#   labs(title = "Log Odds vs. San Jose Creek Discharge")
# 
# ggplot(goleta_beach_data, aes(x = daily_rain, y = calc_log_odds(any_exceed))) +
#   geom_point() +
#   geom_smooth(method = "loess") +
#   labs(title = "Log Odds vs. Daily Rain")

# generate partial residual plots
# library(effects) # partial residual plots are generated using the effects package to further assess linearity
# 
# plot(allEffects(goleta_model))
# Error in `contrasts<-`(`*tmp*`, value = ca) : 
#   contrasts apply only to factors
# In addition: Warning message:
# In model.frame.default(Terms, predict.data, xlev = factor.levels,  :
#   variable 'daily_rain' is not a factor

Arroyo Burro Beach

Code
# test for independence of observations
# this assumption is met since data collection was random and independent
Code
# check for multicollinearity using Variance Inflation Factors (VIF)
# ab_model <- glm(any_exceed ~ ab_discharge_cfs + daily_rain + deposition_occurred, 
#                     data = ab_beach_data, 
#                     family = binomial) # Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : contrasts can be applied only to factors with 2 or more levels
# vif(ab_model) # error: there are aliased coefficients in the model. This is likely due to the binary variable 'deposition_occurred'.  We can ignore this error and proceed with the model.
# 
# cor(ab_beach_data[c("ab_discharge_cfs", "daily_rain", "deposition_occurred")])
# 
# alias(ab_model)


## not using deposition (since it is all 0s since there has been no deposition there)
ab_model <- glm(any_exceed ~ ab_discharge_cfs + daily_rain, 
                    data = ab_beach_data, 
                    family = binomial)
vif(ab_model) # use the vif() function from the car package to check for multicollinearity. VIF values greater than 5 or 10 indicate potential multicollinearity issues
ab_discharge_cfs       daily_rain 
         1.27372          1.27372 
Code
# ab_discharge_cfs       daily_rain 
#          1.27372          1.27372 

# great! no multicollinearity issues
Code
# assess linearity between continuous predictors and log odds of the outcome

# perform Box-Tidwell test, interaction terms are the continuous variables
ab_beach_data$ab_discharge_interact <- ab_beach_data$ab_discharge_cfs * log(ab_beach_data$ab_discharge_cfs + 1) # adding 1 to avoid log(0)
ab_beach_data$daily_rain_interact <- ab_beach_data$daily_rain * log(ab_beach_data$daily_rain + 1) # adding 1 to avoid log(0)

ab_box_tidwell_model <- glm(any_exceed ~ ab_discharge_cfs + daily_rain + 
                           ab_discharge_interact + daily_rain_interact, 
                         data = ab_beach_data, 
                         family = binomial)
summary(ab_box_tidwell_model)

Call:
glm(formula = any_exceed ~ ab_discharge_cfs + daily_rain + ab_discharge_interact + 
    daily_rain_interact, family = binomial, data = ab_beach_data)

Coefficients:
                      Estimate Std. Error z value Pr(>|z|)    
(Intercept)            -2.3788     0.1239 -19.207  < 2e-16 ***
ab_discharge_cfs        0.5060     0.1274   3.972 7.13e-05 ***
daily_rain              1.3363     2.4133   0.554  0.57976    
ab_discharge_interact  -0.1169     0.0389  -3.006  0.00264 ** 
daily_rain_interact     0.8060     4.0304   0.200  0.84150    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1029.19  on 1306  degrees of freedom
Residual deviance:  936.23  on 1302  degrees of freedom
  (2438 observations deleted due to missingness)
AIC: 946.23

Number of Fisher Scoring iterations: 5
Code
# Call:
# glm(formula = any_exceed ~ ab_discharge_cfs + daily_rain + ab_discharge_interact + 
#     daily_rain_interact, family = binomial, data = ab_beach_data)
# 
# Coefficients:
#                       Estimate Std. Error z value Pr(>|z|)    
# (Intercept)            -2.3788     0.1239 -19.207  < 2e-16 ***
# ab_discharge_cfs        0.5060     0.1274   3.972 7.13e-05 ***
# daily_rain              1.3363     2.4133   0.554  0.57976    
# ab_discharge_interact  -0.1169     0.0389  -3.006  0.00264 ** 
# daily_rain_interact     0.8060     4.0304   0.200  0.84150    
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# (Dispersion parameter for binomial family taken to be 1)
# 
#     Null deviance: 1029.19  on 1306  degrees of freedom
# Residual deviance:  936.23  on 1302  degrees of freedom
#   (2438 observations deleted due to missingness)
# AIC: 946.23
# 
# Number of Fisher Scoring iterations: 5

# the Box-Tidwell test is performed by creating interaction terms between continuous predictors and their natural logs, then including these in the model. Significant interaction terms suggest non-linearity.

# non-significant interaction terms (p > 0.05) suggest linearity between the continuous predictors and the log odds of the outcome

# the daily_rain and daily_rain_interact are not statistically significant (p > 0.05), suggesting linearity between daily_rain and the log odds of 'any_exceed'.
# however, the ab_discharge_cfs and ab_discharge_interact are statistically significant (p < 0.05), suggesting non-linearity between ab_discharge_cfs and the log odds of 'any_exceed'.



# --------- Transforming the discharge data to see if it can pass the Box-Tidwell test ------------
ab_beach_data$log_discharge <- log(ab_beach_data$ab_discharge_cfs + 1)
ab_beach_data$log_discharge_interact <- ab_beach_data$log_discharge * log(ab_beach_data$log_discharge + 1) # adding 1 to avoid log(0)

ab_box_tidwell_model_log <- glm(any_exceed ~ log_discharge + daily_rain + log_discharge_interact + daily_rain_interact, 
                         data = ab_beach_data, 
                         family = binomial)
summary(ab_box_tidwell_model_log)

Call:
glm(formula = any_exceed ~ log_discharge + daily_rain + log_discharge_interact + 
    daily_rain_interact, family = binomial, data = ab_beach_data)

Coefficients:
                       Estimate Std. Error z value Pr(>|z|)    
(Intercept)             -2.4562     0.1856 -13.231   <2e-16 ***
log_discharge            0.7282     0.5292   1.376    0.169    
daily_rain               1.8130     2.3495   0.772    0.440    
log_discharge_interact   0.1228     0.4111   0.299    0.765    
daily_rain_interact      0.3000     3.9451   0.076    0.939    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1029.19  on 1306  degrees of freedom
Residual deviance:  937.11  on 1302  degrees of freedom
  (2438 observations deleted due to missingness)
AIC: 947.11

Number of Fisher Scoring iterations: 5
Code
# Call:
# glm(formula = any_exceed ~ log_discharge + daily_rain + log_discharge_interact + 
#     daily_rain_interact, family = binomial, data = ab_beach_data)
# 
# Coefficients:
#                        Estimate Std. Error z value Pr(>|z|)    
# (Intercept)             -2.4562     0.1856 -13.231   <2e-16 ***
# log_discharge            0.7282     0.5292   1.376    0.169    
# daily_rain               1.8130     2.3495   0.772    0.440    
# log_discharge_interact   0.1228     0.4111   0.299    0.765    
# daily_rain_interact      0.3000     3.9451   0.076    0.939    
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# (Dispersion parameter for binomial family taken to be 1)
# 
#     Null deviance: 1029.19  on 1306  degrees of freedom
# Residual deviance:  937.11  on 1302  degrees of freedom
#   (2438 observations deleted due to missingness)
# AIC: 947.11
# 
# Number of Fisher Scoring iterations: 5

# this is great! now none of the interaction terms are statistically significant (p > 0.05), suggesting linearity between the continuous predictors and the log odds of the outcome.

Fit Multivariate Logistic Regression Models

Goleta Beach

For any exceedance of the state health standard: 104 MPN/100mL for enterococcus, 400 MPN/100mL for fecal coliforms, 400 MPN/100mL for ecoli, and 1000 MPN/100mL for total coliforms.

Code
# fit the multivariate logistic regression model
goleta_model <- glm(any_exceed ~ sj_discharge + daily_rain + deposition_occurred, 
                    data = goleta_beach_data, 
                    family = binomial)
summary(goleta_model)

Call:
glm(formula = any_exceed ~ sj_discharge + daily_rain + deposition_occurred, 
    family = binomial, data = goleta_beach_data)

Coefficients:
                    Estimate Std. Error z value Pr(>|z|)    
(Intercept)         -2.61411    0.08552 -30.567  < 2e-16 ***
sj_discharge         0.05822    0.01146   5.081 3.75e-07 ***
daily_rain           2.48553    0.47297   5.255 1.48e-07 ***
deposition_occurred  2.35552    0.22109  10.654  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1512.2  on 2276  degrees of freedom
Residual deviance: 1285.6  on 2273  degrees of freedom
  (3 observations deleted due to missingness)
AIC: 1293.6

Number of Fisher Scoring iterations: 5
Code
# Coefficients:
#                     Estimate Std. Error z value Pr(>|z|)    
# (Intercept)         -2.61411    0.08552 -30.567  < 2e-16 ***
# sj_discharge         0.05822    0.01146   5.081 3.75e-07 ***
# daily_rain           2.48553    0.47297   5.255 1.48e-07 ***
# deposition_occurred  2.35552    0.22109  10.654  < 2e-16 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# (Dispersion parameter for binomial family taken to be 1)
# 
#     Null deviance: 1512.2  on 2276  degrees of freedom
# Residual deviance: 1285.6  on 2273  degrees of freedom
#   (3 observations deleted due to missingness)
# AIC: 1293.6
# 
# Number of Fisher Scoring iterations: 5

# The model predicts 'any_exceed' based on 'sj_discharge', 'daily_rain', and 'deposition_occurred'.

# All predictors are statistically significant:
# sj_discharge: p = 3.75e-07 ***.  For each unit increase, the log odds of 'any_exceed' increase by 0.05822.
# 
# daily_rain: p = 1.48e-07 ***.  For each unit increase, the log odds of 'any_exceed' increase by 2.48553.
# 
# deposition_occurred: p < 2e-16 ***.  When deposition occurs, the log odds of 'any_exceed' increase by 2.35552.

# The estimates show the change in log odds of 'any_exceed' for a one-unit increase in each predictor.


# Null deviance: 1512.2 on 2276 degrees of freedom
# Residual deviance: 1285.6 on 2273 degrees of freedom
## the reduction in deviance in the residual suggests the model is better than a null model.
# AIC: 1293.6 (lower values indicate better model fit)

For only enterococcus exceedances:

Code
# fit the multivariate logistic regression model
goleta_model_enterococcus <- glm(enterococcus_exceed ~ sj_discharge + daily_rain + deposition_occurred, 
                    data = goleta_beach_data, 
                    family = binomial)
summary(goleta_model_enterococcus)

Call:
glm(formula = enterococcus_exceed ~ sj_discharge + daily_rain + 
    deposition_occurred, family = binomial, data = goleta_beach_data)

Coefficients:
                    Estimate Std. Error z value Pr(>|z|)    
(Intercept)         -3.73704    0.14137 -26.435  < 2e-16 ***
sj_discharge         0.03635    0.01062   3.422 0.000621 ***
daily_rain           1.52675    0.55583   2.747 0.006018 ** 
deposition_occurred  2.00391    0.28450   7.044 1.87e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 692.93  on 2276  degrees of freedom
Residual deviance: 618.81  on 2273  degrees of freedom
  (3 observations deleted due to missingness)
AIC: 626.81

Number of Fisher Scoring iterations: 6
Code
# Coefficients:
#                     Estimate Std. Error z value Pr(>|z|)    
# (Intercept)         -3.73704    0.14137 -26.435  < 2e-16 ***
# sj_discharge         0.03635    0.01062   3.422 0.000621 ***
# daily_rain           1.52675    0.55583   2.747 0.006018 ** 
# deposition_occurred  2.00391    0.28450   7.044 1.87e-12 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# (Dispersion parameter for binomial family taken to be 1)
# 
#     Null deviance: 692.93  on 2276  degrees of freedom
# Residual deviance: 618.81  on 2273  degrees of freedom
#   (3 observations deleted due to missingness)
# AIC: 626.81
# 
# Number of Fisher Scoring iterations: 6

It can be interpreted that for each unit increase in San Jose Creek discharge, the log odds of an enterococcus exceedance increase by 0.03635. For each unit increase in daily rain, the log odds of an enterococcus exceedance increase by 1.52675. When deposition occurs, the log odds of an enterococcus exceedance increase by 2.00391.

For only fecal coliform exceedances:

Code
# fit the multivariate logistic regression model
goleta_model_fecal_coliforms <- glm(fecal_coliforms_exceed ~ sj_discharge + daily_rain + deposition_occurred, 
                    data = goleta_beach_data, 
                    family = binomial)
summary(goleta_model_fecal_coliforms)

Call:
glm(formula = fecal_coliforms_exceed ~ sj_discharge + daily_rain + 
    deposition_occurred, family = binomial, data = goleta_beach_data)

Coefficients:
                    Estimate Std. Error z value Pr(>|z|)    
(Intercept)         -5.45009    0.32490 -16.775  < 2e-16 ***
sj_discharge         0.01122    0.03080   0.364    0.716    
daily_rain           1.54283    1.05665   1.460    0.144    
deposition_occurred  2.55626    0.53015   4.822 1.42e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 200.38  on 2276  degrees of freedom
Residual deviance: 176.60  on 2273  degrees of freedom
  (3 observations deleted due to missingness)
AIC: 184.6

Number of Fisher Scoring iterations: 8
Code
# Coefficients:
#                     Estimate Std. Error z value Pr(>|z|)    
# (Intercept)         -5.45009    0.32490 -16.775  < 2e-16 ***
# sj_discharge         0.01122    0.03080   0.364    0.716    
# daily_rain           1.54283    1.05665   1.460    0.144    
# deposition_occurred  2.55626    0.53015   4.822 1.42e-06 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# (Dispersion parameter for binomial family taken to be 1)
# 
#     Null deviance: 200.38  on 2276  degrees of freedom
# Residual deviance: 176.60  on 2273  degrees of freedom
#   (3 observations deleted due to missingness)
# AIC: 184.6
# 
# Number of Fisher Scoring iterations: 8

This is VERY interesting!! Only deposition_occurred is statistically significant for fecal coliform exceedances. For each unit increase in San Jose Creek discharge, the log odds of a fecal coliform exceedance increase by 0.01122. For each unit increase in daily rain, the log odds of a fecal coliform exceedance increase by 1.54283. When deposition occurs, the log odds of a fecal coliform exceedance increase by 2.55626. That is almost a 5-fold increase in the log odds of a fecal coliform exceedance when deposition occurs (exp(2.55626) = 12.88). This can be interpreted as a 12.88 times increase in the odds of a fecal coliform exceedance when deposition occurs.

For only total coliform exceedances:

Code
# fit the multivariate logistic regression model
goleta_model_total_coliforms <- glm(total_coliforms_exceed ~ sj_discharge + daily_rain + deposition_occurred, 
                    data = goleta_beach_data, 
                    family = binomial)
summary(goleta_model_total_coliforms)

Call:
glm(formula = total_coliforms_exceed ~ sj_discharge + daily_rain + 
    deposition_occurred, family = binomial, data = goleta_beach_data)

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)         -3.498516   0.126464 -27.664  < 2e-16 ***
sj_discharge         0.041923   0.009964   4.208 2.58e-05 ***
daily_rain           1.292201   0.546873   2.363   0.0181 *  
deposition_occurred  2.075757   0.259926   7.986 1.39e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 826.78  on 2276  degrees of freedom
Residual deviance: 731.99  on 2273  degrees of freedom
  (3 observations deleted due to missingness)
AIC: 739.99

Number of Fisher Scoring iterations: 6
Code
# Coefficients:
#                      Estimate Std. Error z value Pr(>|z|)    
# (Intercept)         -3.498516   0.126464 -27.664  < 2e-16 ***
# sj_discharge         0.041923   0.009964   4.208 2.58e-05 ***
# daily_rain           1.292201   0.546873   2.363   0.0181 *  
# deposition_occurred  2.075757   0.259926   7.986 1.39e-15 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# (Dispersion parameter for binomial family taken to be 1)
# 
#     Null deviance: 826.78  on 2276  degrees of freedom
# Residual deviance: 731.99  on 2273  degrees of freedom
#   (3 observations deleted due to missingness)
# AIC: 739.99
# 
# Number of Fisher Scoring iterations: 6

These are all significant which makes sense as any_exceed is a reflection of this in a way. The log odds of a total coliform exceedance increase by 0.041923 for each unit increase in San Jose Creek discharge. The log odds of a total coliform exceedance increase by 1.292201 for each unit increase in daily rain. When deposition occurs, the log odds of a total coliform exceedance increase by 2.075757.

Carpinteria Beach

Code
carpenteria_model <- glm(any_exceed ~ carp_discharge + daily_rain + deposition_occurred, 
                    data = carpenteria_beach_data, 
                    family = binomial)
summary(carpenteria_model)

Call:
glm(formula = any_exceed ~ carp_discharge + daily_rain + deposition_occurred, 
    family = binomial, data = carpenteria_beach_data)

Coefficients:
                    Estimate Std. Error z value Pr(>|z|)    
(Intercept)         -1.90930    0.19933  -9.579  < 2e-16 ***
carp_discharge       0.01394    0.00830   1.680   0.0930 .  
daily_rain           2.49736    0.42508   5.875 4.23e-09 ***
deposition_occurred  1.03163    0.44647   2.311   0.0209 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 368.61  on 316  degrees of freedom
Residual deviance: 305.40  on 313  degrees of freedom
  (1843 observations deleted due to missingness)
AIC: 313.4

Number of Fisher Scoring iterations: 4
Code
#                     Estimate Std. Error z value Pr(>|z|)    
# (Intercept)         -1.90930    0.19933  -9.579  < 2e-16 ***
# carp_discharge       0.01394    0.00830   1.680   0.0930 .  
# daily_rain           2.49736    0.42508   5.875 4.23e-09 ***
# deposition_occurred  1.03163    0.44647   2.311   0.0209 *  
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# (Dispersion parameter for binomial family taken to be 1)
# 
#     Null deviance: 368.61  on 316  degrees of freedom
# Residual deviance: 305.40  on 313  degrees of freedom
#   (1843 observations deleted due to missingness)
# AIC: 313.4
# 
# Number of Fisher Scoring iterations: 4

# The model predicts 'any_exceed' based on 'carp_discharge', 'daily_rain', and 'deposition_occurred'.

# Almost all predictors are statistically significant:
# carp_discharge: p = 0.0930, if the alpha were 0.1 then it would be significant.  For each unit increase, the log odds of 'any_exceed' increase by 0.01394.
# daily_rain: p = 4.23e-09 ***.  For each unit increase, the log odds of 'any_exceed' increase by 2.49736.
# deposition_occurred: p = 0.0209 *.  When deposition occurs, the log odds of 'any_exceed' increase by 1.03163.

# The estimates show the change in log odds of 'any_exceed' for a one-unit increase in each predictor.

For Carpinteria’s any exceedance model, daily rain and deposition_occurred are statistically significant. For each unit increase in Carpinteria Creek discharge, the log odds of an exceedance increase by 0.01394. For each unit increase in daily rain, the log odds of an exceedance increase by 2.49736. When deposition occurs, the log odds of an exceedance increase by 1.03163. However, Carpinteria Creek discharge is not statistically significant. If the alpha were 0.1, then it would be significant.

For only enterococcus exceedances:

Code
carpenteria_model_enterococcus <- glm(enterococcus_exceed ~ carp_discharge + daily_rain + deposition_occurred, 
                    data = carpenteria_beach_data, 
                    family = binomial)
summary(carpenteria_model_enterococcus)

Call:
glm(formula = enterococcus_exceed ~ carp_discharge + daily_rain + 
    deposition_occurred, family = binomial, data = carpenteria_beach_data)

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)         -2.840435   0.280285 -10.134  < 2e-16 ***
carp_discharge       0.005078   0.006431   0.790  0.42977    
daily_rain           1.453794   0.447725   3.247  0.00117 ** 
deposition_occurred  0.483269   0.677161   0.714  0.47543    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 198.53  on 316  degrees of freedom
Residual deviance: 184.18  on 313  degrees of freedom
  (1843 observations deleted due to missingness)
AIC: 192.18

Number of Fisher Scoring iterations: 5
Code
# Coefficients:
#                      Estimate Std. Error z value Pr(>|z|)    
# (Intercept)         -2.840435   0.280285 -10.134  < 2e-16 ***
# carp_discharge       0.005078   0.006431   0.790  0.42977    
# daily_rain           1.453794   0.447725   3.247  0.00117 ** 
# deposition_occurred  0.483269   0.677161   0.714  0.47543    
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# (Dispersion parameter for binomial family taken to be 1)
# 
#     Null deviance: 198.53  on 316  degrees of freedom
# Residual deviance: 184.18  on 313  degrees of freedom
#   (1843 observations deleted due to missingness)
# AIC: 192.18
# 
# Number of Fisher Scoring iterations: 5

# only daily_rain is significant here. For each unit increase in Carpinteria Creek discharge, the log odds of an enterococcus exceedance increase by 0.005078. For each unit increase in daily rain, the log odds of an enterococcus exceedance increase by 1.453794. When deposition occurs, the log odds of an enterococcus exceedance increase by 0.483269.

Only daily rain is statistically significant for enterococcus exceedances. For each unit increase in Carpinteria Creek discharge, the log odds of an enterococcus exceedance increase by 0.005078. For each unit increase in daily rain, the log odds of an enterococcus exceedance increase by 1.453794. When deposition occurs, the log odds of an enterococcus exceedance increase by 0.483269.

For only fecal coliform exceedances:

Code
carpenteria_model_fecal_coliforms <- glm(fecal_coliforms_exceed ~ carp_discharge + daily_rain + deposition_occurred, 
                    data = carpenteria_beach_data, 
                    family = binomial)
summary(carpenteria_model_fecal_coliforms)

Call:
glm(formula = fecal_coliforms_exceed ~ carp_discharge + daily_rain + 
    deposition_occurred, family = binomial, data = carpenteria_beach_data)

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)         -4.309047   0.516033  -8.350  < 2e-16 ***
carp_discharge       0.008865   0.008975   0.988  0.32327    
daily_rain           2.041463   0.624148   3.271  0.00107 ** 
deposition_occurred -0.217494   1.427782  -0.152  0.87893    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 95.556  on 316  degrees of freedom
Residual deviance: 80.849  on 313  degrees of freedom
  (1843 observations deleted due to missingness)
AIC: 88.849

Number of Fisher Scoring iterations: 6
Code
#                      Estimate Std. Error z value Pr(>|z|)    
# (Intercept)         -4.309047   0.516033  -8.350  < 2e-16 ***
# carp_discharge       0.008865   0.008975   0.988  0.32327    
# daily_rain           2.041463   0.624148   3.271  0.00107 ** 
# deposition_occurred -0.217494   1.427782  -0.152  0.87893    
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# (Dispersion parameter for binomial family taken to be 1)
# 
#     Null deviance: 95.556  on 316  degrees of freedom
# Residual deviance: 80.849  on 313  degrees of freedom
#   (1843 observations deleted due to missingness)
# AIC: 88.849
# 
# Number of Fisher Scoring iterations: 6

Again, only daily rain is statistically significant for fecal coliform exceedances. For each unit increase in Carpinteria Creek discharge, the log odds of a fecal coliform exceedance increase by 0.008865. For each unit increase in daily rain, the log odds of a fecal coliform exceedance increase by 2.041463. When deposition occurs, the log odds of a fecal coliform exceedance decrease by 0.217494.

For only ecoli exceedances:

Code
carpenteria_model_ecoli <- glm(ecoli_exceed ~ carp_discharge + daily_rain + deposition_occurred, 
                    data = carpenteria_beach_data, 
                    family = binomial)
summary(carpenteria_model_ecoli)

Call:
glm(formula = ecoli_exceed ~ carp_discharge + daily_rain + deposition_occurred, 
    family = binomial, data = carpenteria_beach_data)

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)         -5.665753   1.039347  -5.451    5e-08 ***
carp_discharge      -0.003776   0.015320  -0.246    0.805    
daily_rain           1.780017   1.235258   1.441    0.150    
deposition_occurred  2.131925   1.500636   1.421    0.155    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 33.933  on 316  degrees of freedom
Residual deviance: 30.878  on 313  degrees of freedom
  (1843 observations deleted due to missingness)
AIC: 38.878

Number of Fisher Scoring iterations: 8
Code
# Coefficients:
#                      Estimate Std. Error z value Pr(>|z|)    
# (Intercept)         -5.665753   1.039347  -5.451    5e-08 ***
# carp_discharge      -0.003776   0.015320  -0.246    0.805    
# daily_rain           1.780017   1.235258   1.441    0.150    
# deposition_occurred  2.131925   1.500636   1.421    0.155    
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# (Dispersion parameter for binomial family taken to be 1)
# 
#     Null deviance: 33.933  on 316  degrees of freedom
# Residual deviance: 30.878  on 313  degrees of freedom
#   (1843 observations deleted due to missingness)
# AIC: 38.878
# 
# Number of Fisher Scoring iterations: 8

None of the predictors are statistically significant for ecoli exceedances.

For only total coliform exceedances:

Code
carpenteria_model_total_coliforms <- glm(total_coliforms_exceed ~ carp_discharge + daily_rain + deposition_occurred, 
                    data = carpenteria_beach_data, 
                    family = binomial)
summary(carpenteria_model_total_coliforms)

Call:
glm(formula = total_coliforms_exceed ~ carp_discharge + daily_rain + 
    deposition_occurred, family = binomial, data = carpenteria_beach_data)

Coefficients:
                      Estimate Std. Error z value Pr(>|z|)    
(Intercept)         -2.4308875  0.2405496 -10.106  < 2e-16 ***
carp_discharge       0.0005904  0.0061057   0.097  0.92297    
daily_rain           1.3279157  0.4154074   3.197  0.00139 ** 
deposition_occurred  1.0120347  0.5245308   1.929  0.05368 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 244.17  on 316  degrees of freedom
Residual deviance: 230.32  on 313  degrees of freedom
  (1843 observations deleted due to missingness)
AIC: 238.32

Number of Fisher Scoring iterations: 5
Code
# Coefficients:
#                       Estimate Std. Error z value Pr(>|z|)    
# (Intercept)         -2.4308875  0.2405496 -10.106  < 2e-16 ***
# carp_discharge       0.0005904  0.0061057   0.097  0.92297    
# daily_rain           1.3279157  0.4154074   3.197  0.00139 ** 
# deposition_occurred  1.0120347  0.5245308   1.929  0.05368 .  
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# (Dispersion parameter for binomial family taken to be 1)
# 
#     Null deviance: 244.17  on 316  degrees of freedom
# Residual deviance: 230.32  on 313  degrees of freedom
#   (1843 observations deleted due to missingness)
# AIC: 238.32
# 
# Number of Fisher Scoring iterations: 5

Using an alpha value of 0.05, only daily rain is statistically significant for total coliform exceedances. However, if an alpha of 0.1 is used, then deposition_occurred is also significant. For each unit increase in Carpinteria Creek discharge, the log odds of a total coliform exceedance increase by 0.0005904. For each unit increase in daily rain, the log odds of a total coliform exceedance increase by 1.3279157. When deposition occurs, the log odds of a total coliform exceedance increase by 1.0120347. This is translated to a 2.75 times increase in the odds of a total coliform exceedance when deposition occurs.

Arroyo Burro

Code
# fit the multivariate logistic regression model
ab_any_exceed_model <- glm(any_exceed ~ log_discharge + daily_rain, 
                    data = ab_beach_data, 
                    family = binomial)
summary(ab_any_exceed_model)

Call:
glm(formula = any_exceed ~ log_discharge + daily_rain, family = binomial, 
    data = ab_beach_data)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)    -2.4969     0.1269 -19.676  < 2e-16 ***
log_discharge   0.8783     0.1444   6.082 1.19e-09 ***
daily_rain      2.0350     0.7818   2.603  0.00924 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1029.19  on 1306  degrees of freedom
Residual deviance:  937.21  on 1304  degrees of freedom
  (2438 observations deleted due to missingness)
AIC: 943.21

Number of Fisher Scoring iterations: 4
Code
# Call:
# glm(formula = any_exceed ~ log_discharge + daily_rain, family = binomial, 
#     data = ab_beach_data)
# 
# Coefficients:
#               Estimate Std. Error z value Pr(>|z|)    
# (Intercept)    -2.4969     0.1269 -19.676  < 2e-16 ***
# log_discharge   0.8783     0.1444   6.082 1.19e-09 ***
# daily_rain      2.0350     0.7818   2.603  0.00924 ** 
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# (Dispersion parameter for binomial family taken to be 1)
# 
#     Null deviance: 1029.19  on 1306  degrees of freedom
# Residual deviance:  937.21  on 1304  degrees of freedom
#   (2438 observations deleted due to missingness)
# AIC: 943.21
# 
# Number of Fisher Scoring iterations: 4

# The model predicts 'any_exceed' based on 'log_discharge' and 'daily_rain'.
# Both predictors are statistically significant:
# log_discharge: p = 1.19e-09 ***.  For each unit increase, the log odds of 'any_exceed' increase by 0.8783.
# daily_rain: p = 0.00924 **.  For each unit increase, the log odds of 'any_exceed' increase by 2.0350.

For Arroyo Burro, both log_discharge and daily_rain are statistically significant in causing any type of exceedance. For each unit increase in log discharge, the log odds of an exceedance increase by 0.8783. For each unit increase in daily rain, the log odds of an exceedance increase by 2.0350.

For only enterococcus exceedances:

Code
# fit the multivariate logistic regression model
ab_enterococcus_model <- glm(enterococcus_exceed ~ log_discharge + daily_rain, 
                    data = ab_beach_data, 
                    family = binomial)
summary(ab_enterococcus_model)

Call:
glm(formula = enterococcus_exceed ~ log_discharge + daily_rain, 
    family = binomial, data = ab_beach_data)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)    -3.4455     0.1939 -17.768  < 2e-16 ***
log_discharge   0.5868     0.2085   2.815  0.00488 ** 
daily_rain      0.8750     0.7308   1.197  0.23117    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 492.99  on 1306  degrees of freedom
Residual deviance: 473.36  on 1304  degrees of freedom
  (2438 observations deleted due to missingness)
AIC: 479.36

Number of Fisher Scoring iterations: 6
Code
# Coefficients:
#               Estimate Std. Error z value Pr(>|z|)    
# (Intercept)    -3.4455     0.1939 -17.768  < 2e-16 ***
# log_discharge   0.5868     0.2085   2.815  0.00488 ** 
# daily_rain      0.8750     0.7308   1.197  0.23117    
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# (Dispersion parameter for binomial family taken to be 1)
# 
#     Null deviance: 492.99  on 1306  degrees of freedom
# Residual deviance: 473.36  on 1304  degrees of freedom
#   (2438 observations deleted due to missingness)
# AIC: 479.36
# 
# Number of Fisher Scoring iterations: 6

For enterococcus exceedances, only log_discharge is statistically significant. For each unit increase in log discharge, the log odds of an enterococcus exceedance increase by 0.5868. For each unit increase in daily rain, the log odds of an enterococcus exceedance increase by 0.8750.

For only fecal coliform exceedances:

Code
# fit the multivariate logistic regression model
ab_fecal_coliforms_model <- glm(fecal_coliforms_exceed ~ log_discharge + daily_rain, 
                    data = ab_beach_data, 
                    family = binomial)
summary(ab_fecal_coliforms_model)

Call:
glm(formula = fecal_coliforms_exceed ~ log_discharge + daily_rain, 
    family = binomial, data = ab_beach_data)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)    -5.8962     0.5287 -11.152  < 2e-16 ***
log_discharge   1.0782     0.3802   2.836  0.00457 ** 
daily_rain      1.2226     1.0199   1.199  0.23063    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 127.01  on 1306  degrees of freedom
Residual deviance: 104.29  on 1304  degrees of freedom
  (2438 observations deleted due to missingness)
AIC: 110.29

Number of Fisher Scoring iterations: 8
Code
# Coefficients:
#               Estimate Std. Error z value Pr(>|z|)    
# (Intercept)    -5.8962     0.5287 -11.152  < 2e-16 ***
# log_discharge   1.0782     0.3802   2.836  0.00457 ** 
# daily_rain      1.2226     1.0199   1.199  0.23063    
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# (Dispersion parameter for binomial family taken to be 1)
# 
#     Null deviance: 127.01  on 1306  degrees of freedom
# Residual deviance: 104.29  on 1304  degrees of freedom
#   (2438 observations deleted due to missingness)
# AIC: 110.29
# 
# Number of Fisher Scoring iterations: 8

Again, only log_discharge is statistically significant for fecal coliform exceedances. For each unit increase in log discharge, the log odds of a fecal coliform exceedance increase by 1.0782. For each unit increase in daily rain, the log odds of a fecal coliform exceedance increase by 1.2226.

For only total coliform exceedances:

Code
# fit the multivariate logistic regression model
ab_total_coliforms_model <- glm(total_coliforms_exceed ~ log_discharge + daily_rain, 
                    data = ab_beach_data, 
                    family = binomial)
summary(ab_total_coliforms_model)

Call:
glm(formula = total_coliforms_exceed ~ log_discharge + daily_rain, 
    family = binomial, data = ab_beach_data)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -3.20974    0.16831 -19.070  < 2e-16 ***
log_discharge  0.81180    0.17328   4.685  2.8e-06 ***
daily_rain    -0.05611    0.67946  -0.083    0.934    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 628.93  on 1306  degrees of freedom
Residual deviance: 600.32  on 1304  degrees of freedom
  (2438 observations deleted due to missingness)
AIC: 606.32

Number of Fisher Scoring iterations: 5
Code
# Coefficients:
#               Estimate Std. Error z value Pr(>|z|)    
# (Intercept)   -3.20974    0.16831 -19.070  < 2e-16 ***
# log_discharge  0.81180    0.17328   4.685  2.8e-06 ***
# daily_rain    -0.05611    0.67946  -0.083    0.934    
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# (Dispersion parameter for binomial family taken to be 1)
# 
#     Null deviance: 628.93  on 1306  degrees of freedom
# Residual deviance: 600.32  on 1304  degrees of freedom
#   (2438 observations deleted due to missingness)
# AIC: 606.32
# 
# Number of Fisher Scoring iterations: 5

For total coliform exceedances, only log_discharge is statistically significant. For each unit increase in log discharge, the log odds of a total coliform exceedance increase by 0.81180. For each unit increase in daily rain, the log odds of a total coliform exceedance decrease by 0.05611.

For only ecoli exceedances:

Code
# fit the multivariate logistic regression model
ab_ecoli_model <- glm(ecoli_exceed ~ log_discharge + daily_rain, 
                    data = ab_beach_data, 
                    family = binomial)
summary(ab_ecoli_model)

Call:
glm(formula = ecoli_exceed ~ log_discharge + daily_rain, family = binomial, 
    data = ab_beach_data)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)    -4.7312     0.3493 -13.546   <2e-16 ***
log_discharge   0.6405     0.3484   1.839    0.066 .  
daily_rain      0.4237     1.1795   0.359    0.719    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 190.02  on 1306  degrees of freedom
Residual deviance: 184.15  on 1304  degrees of freedom
  (2438 observations deleted due to missingness)
AIC: 190.15

Number of Fisher Scoring iterations: 7
Code
# Coefficients:
#               Estimate Std. Error z value Pr(>|z|)    
# (Intercept)    -4.7312     0.3493 -13.546   <2e-16 ***
# log_discharge   0.6405     0.3484   1.839    0.066 .  
# daily_rain      0.4237     1.1795   0.359    0.719    
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# (Dispersion parameter for binomial family taken to be 1)
# 
#     Null deviance: 190.02  on 1306  degrees of freedom
# Residual deviance: 184.15  on 1304  degrees of freedom
#   (2438 observations deleted due to missingness)
# AIC: 190.15
# 
# Number of Fisher Scoring iterations: 7

None of the predictors are statistically significant for ecoli exceedances, using an alpha value of 0.05.

Model Diagnostics

Dry Multivariate Logistic Regression Analysis

These models consider only days that have had less than 0.04 inches of rain, as according to USGS’s US Rain Days product https://earlywarning.usgs.gov/usraindry/rdreadme.php. This is to see if the discharge from the creeks is still a significant predictor of exceedances when there is no rain. This is important because the creeks are not receiving additional water from rain, so we can better understand the causality of deposition events on exceedances.

Goleta Beach

Code
# filter the data to only include days with less than 0.1 inches of rain
goleta_beach_data_dry <- goleta_beach_data %>%
  filter(daily_rain < 0.04)

# accounting for lag, filtering to only observations of dry days that have been dry for the past 72 hours (aka three days)
goleta_beach_data_dry_72hrs <- goleta_beach_data_dry %>%
  mutate(daily_rain_lag1 = lag(daily_rain, 1), # using dplyr's lag function to create lagged variables
         daily_rain_lag2 = lag(daily_rain, 2),
         daily_rain_lag3 = lag(daily_rain, 3)) %>%
  filter(daily_rain_lag1 < 0.04) %>%
  filter(daily_rain_lag2 < 0.04) %>%
  filter(daily_rain_lag3 < 0.04)

Testing assumptions for a multiple logistic regression model:

Code
# check for correlation between predictors
gb_dry_cor <- cor(goleta_beach_data_dry_72hrs[, c("sj_discharge", "daily_rain", "deposition_occurred")])
gb_dry_cor
                    sj_discharge   daily_rain deposition_occurred
sj_discharge         1.000000000 -0.007247157          0.29688570
daily_rain          -0.007247157  1.000000000         -0.01487237
deposition_occurred  0.296885696 -0.014872369          1.00000000
Code
#                     sj_discharge   daily_rain deposition_occurred
# sj_discharge         1.000000000 -0.007247157          0.29688570
# daily_rain          -0.007247157  1.000000000         -0.01487237
# deposition_occurred  0.296885696 -0.014872369          1.00000000

No correlation between any of the predictors.

Code
# check for multicollinearity
goleta_any_exceed_dry_model <- glm(any_exceed ~ sj_discharge + daily_rain + deposition_occurred, 
                    data = goleta_beach_data_dry_72hrs, 
                    family = binomial)
vif(goleta_any_exceed_dry_model)
       sj_discharge          daily_rain deposition_occurred 
           1.058205            1.001141            1.058404 
Code
   # sj_discharge          daily_rain deposition_occurred 
   #         1.058205            1.001141            1.058404 

Great, no multicollinearity between the predictors, since all VIF values are less than 5.

Code
# assess linearity between continuous predictors and log odds of the outcome

# perform Box-Tidwell test, interaction terms are the continuous variables
gb_bt_interact_dry_model <- glm(any_exceed ~ sj_discharge + sj_discharge_interact + daily_rain + daily_rain_interact + deposition_occurred, 
                    data = goleta_beach_data_dry_72hrs, 
                    family = binomial)
summary(gb_bt_interact_dry_model)

Call:
glm(formula = any_exceed ~ sj_discharge + sj_discharge_interact + 
    daily_rain + daily_rain_interact + deposition_occurred, family = binomial, 
    data = goleta_beach_data_dry_72hrs)

Coefficients:
                        Estimate Std. Error z value Pr(>|z|)    
(Intercept)           -2.614e+00  1.123e-01 -23.276   <2e-16 ***
sj_discharge           9.878e-02  9.413e-02   1.049    0.294    
sj_discharge_interact -1.254e-02  2.795e-02  -0.449    0.654    
daily_rain             6.770e+01  8.448e+01   0.801    0.423    
daily_rain_interact   -2.228e+03  3.290e+03  -0.677    0.498    
deposition_occurred    2.414e+00  2.421e-01   9.972   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1061.39  on 1616  degrees of freedom
Residual deviance:  926.75  on 1611  degrees of freedom
  (531 observations deleted due to missingness)
AIC: 938.75

Number of Fisher Scoring iterations: 5
Code
# Coefficients:
#                         Estimate Std. Error z value Pr(>|z|)    
# (Intercept)           -2.614e+00  1.123e-01 -23.276   <2e-16 ***
# sj_discharge           9.878e-02  9.413e-02   1.049    0.294    
# sj_discharge_interact -1.254e-02  2.795e-02  -0.449    0.654    
# daily_rain             6.770e+01  8.448e+01   0.801    0.423    
# daily_rain_interact   -2.228e+03  3.290e+03  -0.677    0.498    
# deposition_occurred    2.414e+00  2.421e-01   9.972   <2e-16 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# (Dispersion parameter for binomial family taken to be 1)
# 
#     Null deviance: 1061.39  on 1616  degrees of freedom
# Residual deviance:  926.75  on 1611  degrees of freedom
#   (531 observations deleted due to missingness)
# AIC: 938.75
# 
# Number of Fisher Scoring iterations: 5

Only deposition is statistically significant. None of the continuous variables or their interaction variables within the Box-Tidwell test are significant, so the assumption of linearity is met.

Multivariate logistic regression modeling

For any exceedances:

Code
# fit the multivariate logistic regression model
goleta_any_exceed_dry_model <- glm(any_exceed ~ sj_discharge + daily_rain + deposition_occurred, 
                    data = goleta_beach_data_dry_72hrs, 
                    family = binomial)
summary(goleta_any_exceed_dry_model)

Call:
glm(formula = any_exceed ~ sj_discharge + daily_rain + deposition_occurred, 
    family = binomial, data = goleta_beach_data_dry_72hrs)

Coefficients:
                    Estimate Std. Error z value Pr(>|z|)    
(Intercept)         -2.64836    0.09009 -29.398  < 2e-16 ***
sj_discharge         0.05965    0.01725   3.459 0.000543 ***
daily_rain           2.91663   22.90500   0.127 0.898674    
deposition_occurred  2.49558    0.23295  10.713  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1298.2  on 2147  degrees of freedom
Residual deviance: 1154.3  on 2144  degrees of freedom
AIC: 1162.3

Number of Fisher Scoring iterations: 5
Code
# Coefficients:
#                     Estimate Std. Error z value Pr(>|z|)    
# (Intercept)         -2.64836    0.09009 -29.398  < 2e-16 ***
# sj_discharge         0.05965    0.01725   3.459 0.000543 ***
# daily_rain           2.91663   22.90500   0.127 0.898674    
# deposition_occurred  2.49558    0.23295  10.713  < 2e-16 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# (Dispersion parameter for binomial family taken to be 1)
# 
#     Null deviance: 1298.2  on 2147  degrees of freedom
# Residual deviance: 1154.3  on 2144  degrees of freedom
# AIC: 1162.3
# 
# Number of Fisher Scoring iterations: 5

This is exactly as I expected, in that the rain should not be a significant predictor of exceedances when there is no rain. Both deposition occurrence and San Jose Creek discharge are statistically significant.

The log odds of an exceedance increase by 0.05965 for each unit increase in San Jose Creek discharge. The log odds of an exceedance increase by 2.91663 for each unit increase in daily rain. When deposition occurs, the log odds of an exceedance increase by 2.49558.

For only enterococcus exceedances:

Code
# fit the multivariate logistic regression model
goleta_enterococcus_dry_model <- glm(enterococcus_exceed ~ sj_discharge + daily_rain + deposition_occurred, 
                    data = goleta_beach_data_dry_72hrs, 
                    family = binomial)
summary(goleta_enterococcus_dry_model)

Call:
glm(formula = enterococcus_exceed ~ sj_discharge + daily_rain + 
    deposition_occurred, family = binomial, data = goleta_beach_data_dry_72hrs)

Coefficients:
                    Estimate Std. Error z value Pr(>|z|)    
(Intercept)         -3.82996    0.15424 -24.832  < 2e-16 ***
sj_discharge         0.04252    0.02200   1.933   0.0532 .  
daily_rain          -1.23225   40.80533  -0.030   0.9759    
deposition_occurred  2.25276    0.31813   7.081 1.43e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 582.74  on 2147  degrees of freedom
Residual deviance: 527.85  on 2144  degrees of freedom
AIC: 535.85

Number of Fisher Scoring iterations: 6
Code
# Coefficients:
#                     Estimate Std. Error z value Pr(>|z|)    
# (Intercept)         -3.82996    0.15424 -24.832  < 2e-16 ***
# sj_discharge         0.04252    0.02200   1.933   0.0532 .  
# daily_rain          -1.23225   40.80533  -0.030   0.9759    
# deposition_occurred  2.25276    0.31813   7.081 1.43e-12 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# (Dispersion parameter for binomial family taken to be 1)
# 
#     Null deviance: 582.74  on 2147  degrees of freedom
# Residual deviance: 527.85  on 2144  degrees of freedom
# AIC: 535.85
# 
# Number of Fisher Scoring iterations: 6

Maintaining an alpha of 0.05, only deposition_occurred is statistically significant for enterococcus exceedances. For each unit increase in San Jose Creek discharge, the log odds of an enterococcus exceedance increase by 0.04252. For each unit increase in daily rain, the log odds of an enterococcus exceedance decrease by 1.23225. When deposition occurs, the log odds of an enterococcus exceedance increase by 2.25276.

This is interesting because in the wet model, discharge was significant for enterococcus exceedances. This could be due to the fact that the creek discharge is not as high when there is no rain, so it is not as significant of a predictor. However, deposition is still significant.

For only fecal coliform exceedances:

Code
# fit the multivariate logistic regression model
goleta_fecal_coliforms_dry_model <- glm(fecal_coliforms_exceed ~ sj_discharge + daily_rain + deposition_occurred, 
                    data = goleta_beach_data_dry_72hrs, 
                    family = binomial)
summary(goleta_fecal_coliforms_dry_model)

Call:
glm(formula = fecal_coliforms_exceed ~ sj_discharge + daily_rain + 
    deposition_occurred, family = binomial, data = goleta_beach_data_dry_72hrs)

Coefficients:
                      Estimate Std. Error z value Pr(>|z|)    
(Intercept)         -4.765e+00  4.171e-01 -11.425  < 2e-16 ***
sj_discharge        -1.947e+01  1.178e+01  -1.654   0.0982 .  
daily_rain          -1.427e+03  1.195e+05  -0.012   0.9905    
deposition_occurred  3.931e+00  6.478e-01   6.069 1.29e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 168.84  on 2147  degrees of freedom
Residual deviance: 122.98  on 2144  degrees of freedom
AIC: 130.98

Number of Fisher Scoring iterations: 19
Code
# Coefficients:
#                       Estimate Std. Error z value Pr(>|z|)    
# (Intercept)         -4.765e+00  4.171e-01 -11.425  < 2e-16 ***
# sj_discharge        -1.947e+01  1.178e+01  -1.654   0.0982 .  
# daily_rain          -1.427e+03  1.195e+05  -0.012   0.9905    
# deposition_occurred  3.931e+00  6.478e-01   6.069 1.29e-09 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# (Dispersion parameter for binomial family taken to be 1)
# 
#     Null deviance: 168.84  on 2147  degrees of freedom
# Residual deviance: 122.98  on 2144  degrees of freedom
# AIC: 130.98
# 
# Number of Fisher Scoring iterations: 19

For fecal coliform exceedances, only deposition_occurred is statistically significant. For each unit increase in San Jose Creek discharge, the log odds of a fecal coliform exceedance decrease by 19.47. For each unit increase in daily rain, the log odds of a fecal coliform exceedance decrease by 1427. When deposition occurs, the log odds of a fecal coliform exceedance increase by 3.931. This can be interpreted as a 51.1 times increase in the odds of a fecal coliform exceedance when deposition occurs, since exp(3.931) = 50.95791.

These are the same results as the wet model, except the coefficient for San Jose Creek discharge and rain here are negative, and the log odds are increased in the dry model. In the wet model, for each unit increase in San Jose Creek discharge, the log odds of a fecal coliform exceedance increase by 0.01122, for each unit increase in daily rain, the log odds of a fecal coliform exceedance increase by 1.54283, and when deposition occurs, the log odds of a fecal coliform exceedance increase by 2.55626.

For only ecoli exceedances (missing 2018 data):

Code
# fit the multivariate logistic regression model
goleta_ecoli_dry_model <- glm(ecoli_exceed ~ sj_discharge + daily_rain + deposition_occurred, 
                    data = goleta_beach_data_dry_72hrs, 
                    family = binomial)
summary(goleta_ecoli_dry_model)

Call:
glm(formula = ecoli_exceed ~ sj_discharge + daily_rain + deposition_occurred, 
    family = binomial, data = goleta_beach_data_dry_72hrs)

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)           -4.0535     0.1955 -20.735   <2e-16 ***
sj_discharge          -0.3319     0.3242  -1.024    0.306    
daily_rain            22.1899    37.3029   0.595    0.552    
deposition_occurred  -14.8761  1023.9761  -0.015    0.988    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 332.74  on 2147  degrees of freedom
Residual deviance: 327.40  on 2144  degrees of freedom
AIC: 335.4

Number of Fisher Scoring iterations: 18
Code
# Coefficients:
#                      Estimate Std. Error z value Pr(>|z|)    
# (Intercept)           -4.0535     0.1955 -20.735   <2e-16 ***
# sj_discharge          -0.3319     0.3242  -1.024    0.306    
# daily_rain            22.1899    37.3029   0.595    0.552    
# deposition_occurred  -14.8761  1023.9761  -0.015    0.988    
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# (Dispersion parameter for binomial family taken to be 1)
# 
#     Null deviance: 332.74  on 2147  degrees of freedom
# Residual deviance: 327.40  on 2144  degrees of freedom
# AIC: 335.4
# 
# Number of Fisher Scoring iterations: 18

None of the predictors are statistically significant for ecoli exceedances.

For only total coliform exceedances:

Code
# fit the multivariate logistic regression model
goleta_total_coliforms_dry_model <- glm(total_coliforms_exceed ~ sj_discharge + daily_rain + deposition_occurred, 
                    data = goleta_beach_data_dry_72hrs, 
                    family = binomial)
summary(goleta_total_coliforms_dry_model)

Call:
glm(formula = total_coliforms_exceed ~ sj_discharge + daily_rain + 
    deposition_occurred, family = binomial, data = goleta_beach_data_dry_72hrs)

Coefficients:
                    Estimate Std. Error z value Pr(>|z|)    
(Intercept)         -3.60412    0.13796 -26.124  < 2e-16 ***
sj_discharge         0.06937    0.01837   3.777 0.000159 ***
daily_rain           2.30749   35.15259   0.066 0.947663    
deposition_occurred  2.10372    0.29228   7.198 6.13e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 696.38  on 2147  degrees of freedom
Residual deviance: 624.27  on 2144  degrees of freedom
AIC: 632.27

Number of Fisher Scoring iterations: 6
Code
# Coefficients:
#                     Estimate Std. Error z value Pr(>|z|)    
# (Intercept)         -3.60412    0.13796 -26.124  < 2e-16 ***
# sj_discharge         0.06937    0.01837   3.777 0.000159 ***
# daily_rain           2.30749   35.15259   0.066 0.947663    
# deposition_occurred  2.10372    0.29228   7.198 6.13e-13 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# (Dispersion parameter for binomial family taken to be 1)
# 
#     Null deviance: 696.38  on 2147  degrees of freedom
# Residual deviance: 624.27  on 2144  degrees of freedom
# AIC: 632.27
# 
# Number of Fisher Scoring iterations: 6

For total coliform exceedances, discharge and deposition occurrence are statistically significant. For each unit increase in San Jose Creek discharge, the log odds of a total coliform exceedance increase by 0.06937. For each unit increase in daily rain, the log odds of a total coliform exceedance increase by 2.30749. When deposition occurs, the log odds of a total coliform exceedance increase by 2.10372.

This is a similar result to the wet model, except that daily rain is not significant in the dry model. This is a good gut check, as rain should not be a significant predictor of total coliform exceedances when there is no rain.

Carpinteria Beach

Code
# filter the data to only include days with less than 0.1 inches of rain
carpenteria_beach_data_dry <- carpenteria_beach_data %>%
  filter(daily_rain < 0.04)

# accounting for lag, filtering to only observations of dry days that have been dry for the past 72 hours (aka three days)
carpenteria_beach_data_dry_72hrs <- carpenteria_beach_data_dry %>%
  mutate(daily_rain_lag1 = lag(daily_rain, 1), # using dplyr's lag function to create lagged variables
         daily_rain_lag2 = lag(daily_rain, 2),
         daily_rain_lag3 = lag(daily_rain, 3)) %>%
  filter(daily_rain_lag1 < 0.04) %>%
  filter(daily_rain_lag2 < 0.04) %>%
  filter(daily_rain_lag3 < 0.04)

Testing assumptions for a multiple logistic regression model:

Code
# check for correlation between predictors
cb_dry_cor <- cor(carpenteria_beach_data_dry_72hrs[, c("carp_discharge", "daily_rain", "deposition_occurred")])
cb_dry_cor
                    carp_discharge daily_rain deposition_occurred
carp_discharge          1.00000000  0.3834520          0.06543696
daily_rain              0.38345201  1.0000000         -0.15307819
deposition_occurred     0.06543696 -0.1530782          1.00000000
Code
#                     carp_discharge daily_rain deposition_occurred
# carp_discharge          1.00000000  0.3834520          0.06543696
# daily_rain              0.38345201  1.0000000         -0.15307819
# deposition_occurred     0.06543696 -0.1530782          1.00000000

No correlation is detected between predictors.

Code
# check for multicollinearity
carpenteria_any_exceed_dry_model <- glm(any_exceed ~ carp_discharge + daily_rain + deposition_occurred, 
                    data = carpenteria_beach_data_dry_72hrs, 
                    family = binomial)
vif(carpenteria_any_exceed_dry_model)
     carp_discharge          daily_rain deposition_occurred 
           1.315074            1.457872            1.126186 
Code
    # carp_discharge          daily_rain deposition_occurred 
    #        1.315074            1.457872            1.126186 

Great, all are below 5, so no multicollinearity.

Code
# assess linearity between continuous predictors and log odds of the outcome

# perform Box-Tidwell test, interaction terms are the continuous variables
cb_bt_interact_dry_model <- glm(any_exceed ~ carp_discharge + carp_discharge_interact + daily_rain + daily_rain_interact + deposition_occurred, 
                    data = carpenteria_beach_data_dry_72hrs, 
                    family = binomial)
summary(cb_bt_interact_dry_model)

Call:
glm(formula = any_exceed ~ carp_discharge + carp_discharge_interact + 
    daily_rain + daily_rain_interact + deposition_occurred, family = binomial, 
    data = carpenteria_beach_data_dry_72hrs)

Coefficients:
                          Estimate Std. Error z value Pr(>|z|)
(Intercept)             -3.240e+00  2.092e+00  -1.549    0.121
carp_discharge          -1.427e-01  2.206e-01  -0.647    0.518
carp_discharge_interact  3.870e-02  5.851e-02   0.661    0.508
daily_rain               1.744e+02  2.792e+02   0.625    0.532
daily_rain_interact     -2.716e+03  7.609e+03  -0.357    0.721
deposition_occurred      9.466e-01  7.882e-01   1.201    0.230

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 80.793  on 75  degrees of freedom
Residual deviance: 72.798  on 70  degrees of freedom
  (69 observations deleted due to missingness)
AIC: 84.798

Number of Fisher Scoring iterations: 4
Code
# Coefficients:
#                           Estimate Std. Error z value Pr(>|z|)
# (Intercept)             -3.240e+00  2.092e+00  -1.549    0.121
# carp_discharge          -1.427e-01  2.206e-01  -0.647    0.518
# carp_discharge_interact  3.870e-02  5.851e-02   0.661    0.508
# daily_rain               1.744e+02  2.792e+02   0.625    0.532
# daily_rain_interact     -2.716e+03  7.609e+03  -0.357    0.721
# deposition_occurred      9.466e-01  7.882e-01   1.201    0.230
# 
# (Dispersion parameter for binomial family taken to be 1)
# 
#     Null deviance: 80.793  on 75  degrees of freedom
# Residual deviance: 72.798  on 70  degrees of freedom
#   (69 observations deleted due to missingness)
# AIC: 84.798
# 
# Number of Fisher Scoring iterations: 4

Wow, none of the predictors are statistically significant! This is a good sign that the assumption of linearity is met.

Multivariate logistic regression modeling

For any exceedances:

Code
# fit the multivariate logistic regression model
carpenteria_any_exceed_dry_model <- glm(any_exceed ~ carp_discharge + daily_rain + deposition_occurred, 
                    data = carpenteria_beach_data_dry_72hrs, 
                    family = binomial)
summary(carpenteria_any_exceed_dry_model)

Call:
glm(formula = any_exceed ~ carp_discharge + daily_rain + deposition_occurred, 
    family = binomial, data = carpenteria_beach_data_dry_72hrs)

Coefficients:
                    Estimate Std. Error z value Pr(>|z|)    
(Intercept)         -2.92464    0.62763  -4.660 3.16e-06 ***
carp_discharge       0.04447    0.01732   2.568   0.0102 *  
daily_rain          22.36139   29.57021   0.756   0.4495    
deposition_occurred  1.29655    0.70399   1.842   0.0655 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 108.776  on 144  degrees of freedom
Residual deviance:  93.761  on 141  degrees of freedom
AIC: 101.76

Number of Fisher Scoring iterations: 5
Code
# Coefficients:
#                     Estimate Std. Error z value Pr(>|z|)    
# (Intercept)         -2.92464    0.62763  -4.660 3.16e-06 ***
# carp_discharge       0.04447    0.01732   2.568   0.0102 *  
# daily_rain          22.36139   29.57021   0.756   0.4495    
# deposition_occurred  1.29655    0.70399   1.842   0.0655 .  
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# (Dispersion parameter for binomial family taken to be 1)
# 
#     Null deviance: 108.776  on 144  degrees of freedom
# Residual deviance:  93.761  on 141  degrees of freedom
# AIC: 101.76
# 
# Number of Fisher Scoring iterations: 5

It appears that only Carpinteria Creek discharge is statistically significant for any exceedances. If the alpha value is increased to 0.1, then deposition_occurred is also significant.

For each unit increase in Carpinteria Creek discharge, the log odds of an exceedance increase by 0.04447. For each unit increase in daily rain, the log odds of an exceedance increase by 22.36139. When deposition occurs, the log odds of an exceedance increase by 1.29655.

In the wet model, daily rain was significant, but in the dry model, it is not. This is a good sign that the model is working as expected. However, in the wet model the deposition was significant, but in the dry model it is not.

For only enterococcus exceedances:

Code
# fit the multivariate logistic regression model
carpenteria_enterococcus_dry_model <- glm(enterococcus_exceed ~ carp_discharge + daily_rain + deposition_occurred, 
                    data = carpenteria_beach_data_dry_72hrs, 
                    family = binomial)
summary(carpenteria_enterococcus_dry_model)

Call:
glm(formula = enterococcus_exceed ~ carp_discharge + daily_rain + 
    deposition_occurred, family = binomial, data = carpenteria_beach_data_dry_72hrs)

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)   
(Intercept)          -3.13542    0.95854  -3.271  0.00107 **
carp_discharge        0.05017    0.02762   1.816  0.06935 . 
daily_rain          -22.16000   58.69514  -0.378  0.70577   
deposition_occurred   0.34160    1.17971   0.290  0.77215   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 56.088  on 144  degrees of freedom
Residual deviance: 50.883  on 141  degrees of freedom
AIC: 58.883

Number of Fisher Scoring iterations: 6
Code
# Coefficients:
#                      Estimate Std. Error z value Pr(>|z|)   
# (Intercept)          -3.13542    0.95854  -3.271  0.00107 **
# carp_discharge        0.05017    0.02762   1.816  0.06935 . 
# daily_rain          -22.16000   58.69514  -0.378  0.70577   
# deposition_occurred   0.34160    1.17971   0.290  0.77215   
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# (Dispersion parameter for binomial family taken to be 1)
# 
#     Null deviance: 56.088  on 144  degrees of freedom
# Residual deviance: 50.883  on 141  degrees of freedom
# AIC: 58.883
# 
# Number of Fisher Scoring iterations: 6

None of the predictors are statistically significant for enterococcus exceedances. This makes sense, because in the wet model only daily rain was a significant predictor for enterococcus exceedances.

For only fecal coliform exceedances:

Code
# fit the multivariate logistic regression model
carpenteria_fecal_coliforms_dry_model <- glm(fecal_coliforms_exceed ~ carp_discharge + daily_rain + deposition_occurred, 
                    data = carpenteria_beach_data_dry_72hrs, 
                    family = binomial)
summary(carpenteria_fecal_coliforms_dry_model)

Call:
glm(formula = fecal_coliforms_exceed ~ carp_discharge + daily_rain + 
    deposition_occurred, family = binomial, data = carpenteria_beach_data_dry_72hrs)

Coefficients:
                      Estimate Std. Error z value Pr(>|z|)
(Intercept)         -2.657e+01  5.986e+04       0        1
carp_discharge      -6.584e-17  2.494e+03       0        1
daily_rain          -3.913e-13  3.183e+06       0        1
deposition_occurred -5.364e-15  9.919e+04       0        1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 0.0000e+00  on 144  degrees of freedom
Residual deviance: 8.4123e-10  on 141  degrees of freedom
AIC: 8

Number of Fisher Scoring iterations: 25
Code
## WARNING!!!! There are only 14 observations, which I suspect is why the model is not converging. 

There are only 14 observations for fecal coliform exceedances, which is why the model is not converging. This is not enough data to make any conclusions.

For only ecoli exceedances:

Code
# fit the multivariate logistic regression model
carpenteria_ecoli_dry_model <- glm(ecoli_exceed ~ carp_discharge + daily_rain + deposition_occurred, 
                    data = carpenteria_beach_data_dry_72hrs, 
                    family = binomial)
summary(carpenteria_ecoli_dry_model)

Call:
glm(formula = ecoli_exceed ~ carp_discharge + daily_rain + deposition_occurred, 
    family = binomial, data = carpenteria_beach_data_dry_72hrs)

Coefficients:
                      Estimate Std. Error z value Pr(>|z|)
(Intercept)         -2.657e+01  5.986e+04       0        1
carp_discharge      -6.584e-17  2.494e+03       0        1
daily_rain          -3.913e-13  3.183e+06       0        1
deposition_occurred -5.364e-15  9.919e+04       0        1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 0.0000e+00  on 144  degrees of freedom
Residual deviance: 8.4123e-10  on 141  degrees of freedom
AIC: 8

Number of Fisher Scoring iterations: 25
Code
# there are 34 observations, but the model did not converge

The model did not converge, likely because there are only 34 observations. This is not enough data to make any conclusions.

For only total coliform exceedances:

Code
# fit the multivariate logistic regression model
carpenteria_total_coliforms_dry_model <- glm(total_coliforms_exceed ~ carp_discharge + daily_rain + deposition_occurred, 
                    data = carpenteria_beach_data_dry_72hrs, 
                    family = binomial)
summary(carpenteria_total_coliforms_dry_model)

Call:
glm(formula = total_coliforms_exceed ~ carp_discharge + daily_rain + 
    deposition_occurred, family = binomial, data = carpenteria_beach_data_dry_72hrs)

Coefficients:
                    Estimate Std. Error z value Pr(>|z|)    
(Intercept)         -3.83501    0.82708  -4.637 3.54e-06 ***
carp_discharge       0.02791    0.01801   1.550   0.1212    
daily_rain          41.60966   34.51389   1.206   0.2280    
deposition_occurred  1.72673    0.83178   2.076   0.0379 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 77.878  on 144  degrees of freedom
Residual deviance: 68.470  on 141  degrees of freedom
AIC: 76.47

Number of Fisher Scoring iterations: 6
Code
# Coefficients:
#                     Estimate Std. Error z value Pr(>|z|)    
# (Intercept)         -3.83501    0.82708  -4.637 3.54e-06 ***
# carp_discharge       0.02791    0.01801   1.550   0.1212    
# daily_rain          41.60966   34.51389   1.206   0.2280    
# deposition_occurred  1.72673    0.83178   2.076   0.0379 *  
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# (Dispersion parameter for binomial family taken to be 1)
# 
#     Null deviance: 77.878  on 144  degrees of freedom
# Residual deviance: 68.470  on 141  degrees of freedom
# AIC: 76.47
# 
# Number of Fisher Scoring iterations: 6

Oooo, interesting! Deposition occurrence is the only statistically significant predictor for total coliform exceedances. For each unit increase in Carpinteria Creek discharge, the log odds of a total coliform exceedance increase by 0.02791. For each unit increase in daily rain, the log odds of a total coliform exceedance increase by 41.60966. When deposition occurs, the log odds of a total coliform exceedance increase by 1.72673.

This is cool because in the wet model, only daily rain was significant at an alpha value of 0.05, though deposition would have been signficiant with an alpha of 0.1. In the dry model, daily rain is not significant, but deposition is.

I am wary of the very high coefficient for daily rain, but I think this is due to the small sample size (48) … I will need to look into this further.

Arroyo Burro Beach

Code
# filter the data to only include days with less than 0.1 inches of rain
ab_beach_data_dry <- ab_beach_data %>%
  filter(daily_rain < 0.04)

# accounting for lag, filtering to only observations of dry days that have been dry for the past 72 hours (aka three days)
ab_beach_data_dry_72hrs <- ab_beach_data_dry %>%
  mutate(daily_rain_lag1 = lag(daily_rain, 1), # using dplyr's lag function to create lagged variables
         daily_rain_lag2 = lag(daily_rain, 2),
         daily_rain_lag3 = lag(daily_rain, 3)) %>%
  filter(daily_rain_lag1 < 0.04) %>%
  filter(daily_rain_lag2 < 0.04) %>%
  filter(daily_rain_lag3 < 0.04)

Testing assumptions for a multiple logistic regression model:

Code
ab_beach_data_dry_72hrs$ab_discharge_cfs <- as.numeric(ab_beach_data_dry_72hrs$ab_discharge_cfs)

# check for multicollinearity
arroyo_burro_any_exceed_dry_model <- glm(any_exceed ~ ab_discharge_cfs + daily_rain, 
                    data = ab_beach_data_dry_72hrs, 
                    family = binomial)
arroyo_burro_any_exceed_dry_model

Call:  glm(formula = any_exceed ~ ab_discharge_cfs + daily_rain, family = binomial, 
    data = ab_beach_data_dry_72hrs)

Coefficients:
     (Intercept)  ab_discharge_cfs        daily_rain  
        -2.11934           0.07004          25.58958  

Degrees of Freedom: 1240 Total (i.e. Null);  1238 Residual
  (959 observations deleted due to missingness)
Null Deviance:      886.8 
Residual Deviance: 881.5    AIC: 887.5
Code
# Coefficients:
#      (Intercept)  ab_discharge_cfs        daily_rain  
#         -2.11934           0.07004          25.58958
# HIGH multicolinearity....


ab_any_exceed_log_dry_model <- glm(any_exceed ~ log_discharge + daily_rain, 
                    data = ab_beach_data_dry_72hrs, 
                    family = binomial)
vif(ab_any_exceed_log_dry_model)
log_discharge    daily_rain 
     1.002393      1.002393 
Code
# log_discharge    daily_rain 
#      1.002393      1.002393 
# no issue once the discharge is transformed!

No multicollinearity between the predictors once the discharge is transformed.

Code
# assess linearity between continuous predictors and log odds of the outcome

# perform Box-Tidwell test, interaction terms are the continuous variables
ab_bt_interact_dry_model <- glm(any_exceed ~ log_discharge + log_discharge_interact + daily_rain + daily_rain_interact, 
                    data = ab_beach_data_dry_72hrs, 
                    family = binomial)
summary(ab_bt_interact_dry_model)

Call:
glm(formula = any_exceed ~ log_discharge + log_discharge_interact + 
    daily_rain + daily_rain_interact, family = binomial, data = ab_beach_data_dry_72hrs)

Coefficients:
                         Estimate Std. Error z value Pr(>|z|)    
(Intercept)               -2.6189     0.2022 -12.950   <2e-16 ***
log_discharge              1.6354     0.6373   2.566   0.0103 *  
log_discharge_interact    -0.9001     0.5518  -1.631   0.1029    
daily_rain                54.4597    84.6698   0.643   0.5201    
daily_rain_interact    -1750.7777  3815.1913  -0.459   0.6463    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 886.85  on 1240  degrees of freedom
Residual deviance: 871.06  on 1236  degrees of freedom
  (959 observations deleted due to missingness)
AIC: 881.06

Number of Fisher Scoring iterations: 5
Code
# Coefficients:
#                          Estimate Std. Error z value Pr(>|z|)    
# (Intercept)               -2.6189     0.2022 -12.950   <2e-16 ***
# log_discharge              1.6354     0.6373   2.566   0.0103 *  
# log_discharge_interact    -0.9001     0.5518  -1.631   0.1029    
# daily_rain                54.4597    84.6698   0.643   0.5201    
# daily_rain_interact    -1750.7777  3815.1913  -0.459   0.6463    
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# (Dispersion parameter for binomial family taken to be 1)
# 
#     Null deviance: 886.85  on 1240  degrees of freedom
# Residual deviance: 871.06  on 1236  degrees of freedom
#   (959 observations deleted due to missingness)
# AIC: 881.06
# 
# Number of Fisher Scoring iterations: 5

So, it appears that log_discharge is statistically significant, but daily rain is not. The interaction term for log_discharge (log_discharge_interact) is not significant (p = 0.1029), suggesting the relationship between log-transformed discharge and the log odds of exceedance is approximately linear.

Multivariate logistic regression modeling

For any exceedances:

Code
# fit the multivariate logistic regression model
ab_any_exceed_dry_model <- glm(any_exceed ~ log_discharge + daily_rain, 
                    data = ab_beach_data_dry_72hrs, 
                    family = binomial)
summary(ab_any_exceed_dry_model)

Call:
glm(formula = any_exceed ~ log_discharge + daily_rain, family = binomial, 
    data = ab_beach_data_dry_72hrs)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)    -2.3645     0.1316 -17.966   <2e-16 ***
log_discharge   0.6030     0.1668   3.615   0.0003 ***
daily_rain     20.3462    29.5277   0.689   0.4908    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 886.85  on 1240  degrees of freedom
Residual deviance: 874.21  on 1238  degrees of freedom
  (959 observations deleted due to missingness)
AIC: 880.21

Number of Fisher Scoring iterations: 4
Code
# Coefficients:
#               Estimate Std. Error z value Pr(>|z|)    
# (Intercept)    -2.3645     0.1316 -17.966   <2e-16 ***
# log_discharge   0.6030     0.1668   3.615   0.0003 ***
# daily_rain     20.3462    29.5277   0.689   0.4908    
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# (Dispersion parameter for binomial family taken to be 1)
# 
#     Null deviance: 886.85  on 1240  degrees of freedom
# Residual deviance: 874.21  on 1238  degrees of freedom
#   (959 observations deleted due to missingness)
# AIC: 880.21
# 
# Number of Fisher Scoring iterations: 4

Only log_discharge is statistically significant for any exceedances. For each unit increase in log-transformed discharge, the log odds of an exceedance increase by 0.6030. For each unit increase in daily rain, the log odds of an exceedance increase by 20.3462.

This is interesting because in the wet model, both discharge and rain were significant predictors of any exceedances. In the dry model, only discharge is significant.

For only enterococcus exceedances:

Code
# fit the multivariate logistic regression model
ab_enterococcus_dry_model <- glm(enterococcus_exceed ~ log_discharge + daily_rain, 
                    data = ab_beach_data_dry_72hrs, 
                    family = binomial)
summary(ab_enterococcus_dry_model)

Call:
glm(formula = enterococcus_exceed ~ log_discharge + daily_rain, 
    family = binomial, data = ab_beach_data_dry_72hrs)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)    -3.3185     0.2101 -15.792   <2e-16 ***
log_discharge   0.2704     0.2854   0.947    0.343    
daily_rain     24.8899    44.9868   0.553    0.580    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 419.12  on 1240  degrees of freedom
Residual deviance: 417.97  on 1238  degrees of freedom
  (959 observations deleted due to missingness)
AIC: 423.97

Number of Fisher Scoring iterations: 6
Code
# Coefficients:
#               Estimate Std. Error z value Pr(>|z|)    
# (Intercept)    -3.3185     0.2101 -15.792   <2e-16 ***
# log_discharge   0.2704     0.2854   0.947    0.343    
# daily_rain     24.8899    44.9868   0.553    0.580    
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# (Dispersion parameter for binomial family taken to be 1)
# 
#     Null deviance: 419.12  on 1240  degrees of freedom
# Residual deviance: 417.97  on 1238  degrees of freedom
#   (959 observations deleted due to missingness)
# AIC: 423.97
# 
# Number of Fisher Scoring iterations: 6

None of the predictors are statistically significant for enterococcus exceedances. This is interesting because in the wet model, discharge was a significant predictor of enterococcus exceedances, but in the dry model it is not. Perhaps rain is a confounder in the wet model.

For only fecal coliform exceedances:

Code
# fit the multivariate logistic regression model
ab_fecal_coliforms_dry_model <- glm(fecal_coliforms_exceed ~ log_discharge + daily_rain, 
                    data = ab_beach_data_dry_72hrs, 
                    family = binomial)
summary(ab_fecal_coliforms_dry_model)

Call:
glm(formula = fecal_coliforms_exceed ~ log_discharge + daily_rain, 
    family = binomial, data = ab_beach_data_dry_72hrs)

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -5.207e+00  5.989e-01  -8.694   <2e-16 ***
log_discharge -1.978e-01  9.957e-01  -0.199    0.843    
daily_rain    -1.354e+03  1.417e+05  -0.010    0.992    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 75.954  on 1240  degrees of freedom
Residual deviance: 75.558  on 1238  degrees of freedom
  (959 observations deleted due to missingness)
AIC: 81.558

Number of Fisher Scoring iterations: 18
Code
# Coefficients:
#                 Estimate Std. Error z value Pr(>|z|)    
# (Intercept)   -5.207e+00  5.989e-01  -8.694   <2e-16 ***
# log_discharge -1.978e-01  9.957e-01  -0.199    0.843    
# daily_rain    -1.354e+03  1.417e+05  -0.010    0.992    
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# (Dispersion parameter for binomial family taken to be 1)
# 
#     Null deviance: 75.954  on 1240  degrees of freedom
# Residual deviance: 75.558  on 1238  degrees of freedom
#   (959 observations deleted due to missingness)
# AIC: 81.558
# 
# Number of Fisher Scoring iterations: 18

Nothing is statistically significant for fecal coliform exceedances. In the wet model, only discharge was significant.

For only ecoli exceedances:

Code
# fit the multivariate logistic regression model
ab_ecoli_dry_model <- glm(ecoli_exceed ~ log_discharge + daily_rain, 
                    data = ab_beach_data_dry_72hrs, 
                    family = binomial)
summary(ab_ecoli_dry_model)

Call:
glm(formula = ecoli_exceed ~ log_discharge + daily_rain, family = binomial, 
    data = ab_beach_data_dry_72hrs)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)    -4.7498     0.3799 -12.503   <2e-16 ***
log_discharge   0.5513     0.4400   1.253    0.210    
daily_rain     63.6080    51.5227   1.235    0.217    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 162.29  on 1240  degrees of freedom
Residual deviance: 159.74  on 1238  degrees of freedom
  (959 observations deleted due to missingness)
AIC: 165.74

Number of Fisher Scoring iterations: 7
Code
# Coefficients:
#               Estimate Std. Error z value Pr(>|z|)    
# (Intercept)    -4.7498     0.3799 -12.503   <2e-16 ***
# log_discharge   0.5513     0.4400   1.253    0.210    
# daily_rain     63.6080    51.5227   1.235    0.217    
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# (Dispersion parameter for binomial family taken to be 1)
# 
#     Null deviance: 162.29  on 1240  degrees of freedom
# Residual deviance: 159.74  on 1238  degrees of freedom
#   (959 observations deleted due to missingness)
# AIC: 165.74
# 
# Number of Fisher Scoring iterations: 7

Again, nothing is statistically significant for ecoli exceedances. In the wet model, only discharge was significant.

For only total coliform exceedances:

Code
# fit the multivariate logistic regression model
ab_total_coliforms_dry_model <- glm(total_coliforms_exceed ~ log_discharge + daily_rain, 
                    data = ab_beach_data_dry_72hrs, 
                    family = binomial)
summary(ab_total_coliforms_dry_model)

Call:
glm(formula = total_coliforms_exceed ~ log_discharge + daily_rain, 
    family = binomial, data = ab_beach_data_dry_72hrs)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)    -3.1931     0.1791 -17.828  < 2e-16 ***
log_discharge   0.7359     0.2044   3.600 0.000319 ***
daily_rain     -0.7220    45.1669  -0.016 0.987246    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 549.71  on 1240  degrees of freedom
Residual deviance: 538.60  on 1238  degrees of freedom
  (959 observations deleted due to missingness)
AIC: 544.6

Number of Fisher Scoring iterations: 5
Code
# Coefficients:
#               Estimate Std. Error z value Pr(>|z|)    
# (Intercept)    -3.1931     0.1791 -17.828  < 2e-16 ***
# log_discharge   0.7359     0.2044   3.600 0.000319 ***
# daily_rain     -0.7220    45.1669  -0.016 0.987246    
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# (Dispersion parameter for binomial family taken to be 1)
# 
#     Null deviance: 549.71  on 1240  degrees of freedom
# Residual deviance: 538.60  on 1238  degrees of freedom
#   (959 observations deleted due to missingness)
# AIC: 544.6
# 
# Number of Fisher Scoring iterations: 5

Wow, here the discharge is significant for total coliform exceedances, but daily rain is not. This is the same outcome as the wet model. It would be good to consider why total coliforms might still be significant in the dry model, but not the other bacteria types other than any exceedances.

When there is discharge, it increases the log odds of a total coliform exceedance by 0.7359. When there is daily rain, it decreases the log odds of a total coliform exceedance by 0.7220.

Model Diagnostics